Using Deep Learning to Find Basketball Highlights

Hudl stores petabytes of video. In that video there are a lot of awe­some plays. Figuring out which plays are the most inter­est­ing and sift­ing through the unin­ter­est­ing footage is a huge chal­lenge. To solve this prob­lem, we lever­aged deep learn­ing, Amazon Mechanical Turk, and crowd noise. The result: bas­ket­ball highlights!

Using Deep Learning to Find Basketball Highlights

Hudl stores petabytes of video. In that video there are a lot of awe­some plays. Figuring out which plays are the most inter­est­ing and sift­ing through the unin­ter­est­ing footage is a huge chal­lenge. To solve this prob­lem, we lever­aged deep learn­ing, Amazon Mechanical Turk, and crowd noise. The result: bas­ket­ball highlights!

At Hudl, we would love to be able to watch every video uploaded to our servers and high­light the most impres­sive plays. Unfortunately, time con­straints make this an impos­si­ble dream. There is, how­ev­er, a group of peo­ple who have watched every game: the fans. Rather than polling these fans to find the best plays from every game, we decid­ed to use their response to iden­ti­fy high­light-wor­thy plays.

More specif­i­cal­ly, we will train clas­si­fiers that can rec­og­nize the dif­fer­ence between high­light wor­thy (sig­nal) and non-high­light wor­thy (back­ground) clips. The input to the clas­si­fiers will be the audio and video data and the out­put will be a score that rep­re­sents the prob­a­bil­i­ty that the clip is highlight-worthy.

Training Sample

To select a sam­ple of events to train our clas­si­fi­er on, we cre­at­ed a sam­ple of 4153 clips, each 10 sec­onds long, from bas­ket­ball games. No more than two clips come from the same bas­ket­ball game and most are from dif­fer­ent games played by dif­fer­ent teams. This is to pre­vent the clas­si­fi­er from over­fit­ting for a spe­cif­ic audi­ence or are­na. About half of these are clips dur­ing which a suc­cess­ful 3 point shot occurs. The oth­er half are a semi-ran­dom selec­tion of footage from bas­ket­ball games.

We used Amazon Mechanical Turk (mTurk) to sep­a­rate the plays with the most cheer­ing from those with no cheer­ing or no suc­cess­ful shot. Each clip was sent to two or three sep­a­rate Turkers. To sep­a­rate high­light-wor­thy clips from non-high­light wor­thy clips, we gave two or three Turkers the fol­low­ing instructions:

The dis­tri­b­u­tion of aver­age scores for the 4153 clips is shown below:

Clips that were unan­i­mous­ly scored as 3” were select­ed as our cheer­ing sig­nal” while clips that were unan­i­mous­ly scored as 0” are con­sid­ered to be back­ground. This choice was made to pro­vide max­i­mum sep­a­ra­tion between sig­nal and back­ground. Moving for­ward, using a mul­ti-class clas­si­fi­er that incor­po­rates clips with a 1” or a 2” could improve the per­for­mance of the clas­si­fi­er when it is used on entire games. For the time being, how­ev­er, we use a sam­ple of 887 sig­nal clips and 1320 back­ground clips.

Pre-Processing Audio

Before send­ing our audio data to a deep learn­ing algo­rithm, we want­ed to process it to make more intel­li­gi­ble than a series of raw audio ampli­tudes. We decid­ed to con­vert our audio into an audio image that would reduce the data present in a 44,100 Hz wav file with­out los­ing the fea­tures that make it pos­si­ble to dis­tin­guish cheer­ing. To cre­ate an audio image, we went through the fol­low­ing steps:

  1. Convert the stereo audio to mono by drop­ping one of the two audio channels.
  2. Use a fast Fourier trans­form to con­vert the audio from the time-domain to the frequency-domain.
  3. Use 16 octave bands to slice the data into dif­fer­ent fre­quen­cy bins.
  4. Convert each fre­quen­cy bin back to the time-domain.
  5. Create a 2D-map using the fre­quen­cy bins as the Y-axis, the time as the X-axis and the ampli­tude as the Z-axis.

Video Classifiers

Although audio cheer­ing seems like an obvi­ous way to iden­ti­fy high­lights, it is pos­si­ble that the video could also be used to sep­a­rate high­lights. We decid­ed to train three visu­al clas­si­fiers using raw frames from the video. The first clas­si­fi­er is trained on video frames tak­en from 2 sec­ond mark of each 10 sec­ond clip, the sec­ond is trained on frames tak­en from the 5 sec­ond mark of each clip, and the third is trained on frames tak­en from the 8 sec­ond mark of each clip. Below are shown rep­re­sen­ta­tive frames from an exam­ple clip at 2, 5, and 8 sec­onds (left to right).

Deep Learning Framework

Because our 2D audio maps can be visu­al­ized as images, we decid­ed to use Metamind as our deep learn­ing engine. Metamind pro­vides an easy-to-use Python API that lets the user train accu­rate image clas­si­fiers. Each clas­si­fi­er accepts as input an audio image and out­puts a score that rep­re­sents the prob­a­bil­i­ty that the pre­dic­tion is correct.

Results

To train our clas­si­fi­er we split our 887 sig­nal clips and 1320 back­ground clips into train and test sam­ples. 85% of the clips are used to train the clas­si­fiers while 15% of the clips are reserved to test the clas­si­fiers. In total, we trained four classifiers:

  1. Audio Image
  2. Video Frame (2 seconds)
  3. Video Frame (5 seconds)
  4. Video Frame (8 seconds)

Signal Background Separation

To test how well each clas­si­fi­er faired, we con­sid­er the pre­dic­tions of the clas­si­fiers on the reserved test set. Because the clas­si­fi­er was not trained on these clips, over­fit­ting can­not be caus­ing the observed per­for­mance on the test set. The pre­dic­tions for sig­nal and back­ground for each of the four clas­si­fiers are shown in the plots below. The X-axis is the pre­dict­ed prob­a­bil­i­ty of being sig­nal (i.e. the out­put vari­able of the clas­si­fi­er) and the Y-axis is the num­ber of clips that were pre­dict­ed to have that prob­a­bil­i­ty. The red his­togram indi­cates back­ground clips and the blue his­togram indi­cates sig­nal clips.

Receiver Operating Characteristic

The receiv­er oper­at­ing char­ac­ter­is­tic (ROC) curve is a graph­i­cal way to illus­trate the per­for­mance of a bina­ry clas­si­fi­er when the dis­crim­i­na­tion thresh­old is changed. In our case, the dis­crim­i­na­tion thresh­old is the val­ue of the out­put of our clas­si­fi­er above which a clip is deter­mined to be sig­nal. We can change this val­ue to improve our true pos­i­tive rate (the num­ber of sig­nal we cor­rect­ly clas­si­fy as sig­nal) or reduce our false pos­i­tive rate (the num­ber of back­ground that we incor­rect­ly clas­si­fy as sig­nal). For exam­ple, by set­ting our thresh­old to 1, we would clas­si­fy no clips as sig­nal and there­by have 0% false pos­i­tive rate (at the expense of a 0% true pos­i­tive rate). Alternatively, we could set our thresh­old to 0 and clas­si­fy all clips as sig­nal there­by giv­ing us a 100% true pos­i­tive rate (at the expense of a 100% false pos­i­tive rate).

The ROC curve for each of the four clas­si­fiers is shown below. A sin­gle num­ber that rep­re­sents the strength of a clas­si­fi­er is known as the ROC area under the curve (AUC). This inte­gral rep­re­sents how well a clas­si­fi­er is able to dif­fer­en­ti­ate between sig­nal and back­ground along all work­ing points. The curves shown are the aver­age of boot­strapped sam­ples and the fuzzy band around the curve rep­re­sent the pos­si­ble ways in which the ROC curve could rea­son­ably fluctuate.

Combining Classifiers

Because each of these clas­si­fiers pro­vides dif­fer­ent infor­ma­tion, it’s pos­si­ble that their com­bi­na­tion could per­form bet­ter than any sin­gle clas­si­fi­er alone. To com­bine clas­si­fiers we must train a third clas­si­fi­er that takes, as fea­tures, the prob­a­bil­i­ties from the orig­i­nal clas­si­fiers and returns a sin­gle probability.

To visu­al­ize the per­for­mance of these com­bined clas­si­fiers we make a 2D plot with each axis rep­re­sent­ing the input prob­a­bil­i­ty. Each test clip is plot­ted as a point in this 2D-space and is col­ored blue, if sig­nal, or red, if back­ground. The pre­dic­tion of the com­bined clas­si­fi­er is plot­ted in the back­ground as a 2D-col­or map. The col­or rep­re­sents the com­bined classifier’s pre­dict­ed prob­a­bil­i­ty of being sig­nal or background.

Combined Classifier Performance

We cre­ate the ROC curves as before in order to eval­u­ate the per­for­mance of these com­bined clas­si­fiers. As expect­ed, the com­bined clas­si­fiers which include the audio clas­si­fi­er per­form the best and the improve­ment in ROC AUC from audio alone to audio plus video is 0.96 to 0.97. This is not a dra­mat­ic improve­ment, but it demon­strates that there are gains to be had from adding visu­al infor­ma­tion. When two visu­al clas­si­fiers are added togeth­er, the ROC AUC increas­es from ~0.79 to ~0.83. This increase indi­cates that there is addi­tion­al infor­ma­tion to be gained from uti­liz­ing dif­fer­ent times in the video.

Final Combination

A final com­bi­na­tion of all four clas­si­fiers was per­formed but this ulti­mate com­bi­na­tion was no bet­ter than the pair­wise com­bi­na­tion of audio and video. This indi­cates that fur­ther improve­ments to our clas­si­fi­ca­tion would need to come from tweaks to the pre-pro­cess­ing of the data or the clas­si­fiers them­selves rather than by sim­ply adding addi­tion­al video clas­si­fiers to the mix.

Full Game Testing

Despite the fact that we have eval­u­at­ed our clas­si­fiers on test data, this test­ing has been per­formed in a very con­trolled set­ting. This is because the back­grounds we have used are not nec­es­sar­i­ly rep­re­sen­ta­tive of the clips present across an entire game. Furthermore, our abil­i­ty to sep­a­rate sig­nal from back­ground is use­less if our top pre­dic­tions in a spe­cif­ic game are not, in fact, among the top plays in that game.

To eval­u­ate our clas­si­fi­er in the wild we will split four games into over­lap­ping 10 sec­ond clips. Overlapping clips means that we make clips for 0 sec­onds to 10 sec­onds, 5 sec­onds to 15 sec­onds, 10 sec­onds to 20 sec­onds, etc… These clips are then passed through the audio clas­si­fi­er. Our goal in doing this this is to answer the fol­low­ing three questions:

  1. What is the dis­tri­b­u­tion of prob­a­bil­i­ties for clips in a whole game?
  2. How many of our top picks” are high­light worthy?
  3. Does our sig­nal prob­a­bil­i­ty rat­ing rep­re­sent the true prob­a­bil­i­ty of a clip being signal?

Probability Distributions

The prob­a­bil­i­ty dis­tri­b­u­tion of clips for the four test games is found below.

As is seen in the above images, the major­i­ty of clips are clas­si­fied as background.

Top Picks

An ani­mat­ed gif for the top clip from each of the four test games is shown below. In addi­tion, the top five clips from each game and their prob­a­bil­i­ties are shown and the con­tent of each clip is discussed.

Of these, we would con­sid­er the made shots to be sig­nal which gives us 14 sig­nal out of 20 total clips. Additionally, the top play of each game is signal.

To under­stand our expec­ta­tions, we use the Poisson Binomial dis­tri­b­u­tion. The mean is the sum of all 20 prob­a­bil­i­ties and the stan­dard devi­a­tion is the square root of the sum of probability*(1-probability) for each of the 20 prob­a­bil­i­ties. This indi­cates that we should expect 14.7 +/- 2.1 sig­nal events. Our 14 observed sig­nal events are con­sis­tent with this expec­ta­tion as seen in the dis­tri­b­u­tion below.

Next Steps

There are many addi­tion­al steps that could be used to improve the per­for­mance of the high­light clas­si­fi­er and there are a num­ber of chal­lenges to be solved before using the clas­si­fi­er is prac­ti­cal on a large scale.

Some of these are per­for­mance relat­ed: fast pro­cess­ing and cre­ation of the audio images.

Others are more prac­ti­cal: we need to be able to deter­mine which team a high­light is for so we don’t sug­gest that a play­er tag a high­light of them get­ting dunked on.

Additionally, there are improve­ments to the clas­si­fiers them­selves: these can include an increase in the size of the train­ing sam­ple or per­form­ing more pre­pro­cess­ing of the data to make signal/​background dis­crim­i­na­tion easier.

The last and per­haps most impor­tant step is the opti­miza­tion of the video clas­si­fiers: right now these video clas­si­fiers pro­vide min­i­mal val­ue when com­bined with the audio clas­si­fiers, but this val­ue could be increased sub­stan­tial­ly if we were to stan­dard­ize the loca­tion of the cheer­ing” with­in each clip. This would help us to dis­tin­guish between impres­sive suc­cess­ful shots, free throws, and plays that occur away from the basket.

It’s an excit­ing time for the prod­uct team at Hudl and we’re con­stant­ly com­ing up with inno­v­a­tive new projects to tack­le. If you are inter­est­ed in work­ing with us to solve the next set of prob­lems, check out our job post­ings!