Data Science on Firesquads: Classifying Emails with Naïve Bayes

At Hudl, each squad on the prod­uct team takes two weeks each year to help out the coach rela­tions team in an ongo­ing rota­tion known as Firesquads. This year, for Firesquads, the data sci­ence squad built a Naïve Bayes clas­si­fi­er to auto­mate the task of cat­e­go­riz­ing emails.

Data Science on Firesquads: Classifying Emails with Naïve Bayes

At Hudl, each squad on the prod­uct team takes two weeks each year to help out the coach rela­tions team in an ongo­ing rota­tion known as Firesquads. This year, for Firesquads, the data sci­ence squad built a Naïve Bayes clas­si­fi­er to auto­mate the task of cat­e­go­riz­ing emails.

At Hudl, we take great pride in our Coach Relations team and the world-class sup­port they pro­vide for our cus­tomers. To help them out and to fos­ter com­mu­ni­ca­tion between the prod­uct team and Coach Relations, we have an ongo­ing rota­tion known as Firesquad. Each squad on the prod­uct team takes a two-week Firesquad rota­tion, dur­ing which we build tools and fix bugs that will help Coach Relations pro­vide sup­port more effi­cient­ly and more painlessly.

Introduction

This year, for our Firesquad rota­tion, we on the Data Science squad want­ed to help auto­mate the clas­si­fi­ca­tion of sup­port emails. The short-term goal was to reduce the time Coach Relations needs to spend when answer­ing emails. Longer term, this tool could allow us to auto­mat­i­cal­ly detect pat­terns and raise alarms when spe­cif­ic sup­port requests are occur­ring at an abnor­mal rate.

Road-mapping

At a prac­ti­cal lev­el, we had two chal­lenges to solve:

  1. Train an effec­tive classifier.
  2. Build an infra­struc­ture that reads emails from Zendesk, clas­si­fies them, and writes those clas­si­fi­ca­tions back to Zendesk.

In order to solve these chal­lenges, we used the fol­low­ing technologies.

Modeling: Apache Spark’s MLlib

Although there are many machine learn­ing imple­men­ta­tions that could be used to clas­si­fy emails, few of them can train on large datasets as effi­cient­ly as Apache Spark’s MLlib. Given the large num­ber of emails in our train­ing sam­ple and the even larg­er num­ber of fea­tures that we antic­i­pate using, the choice of MLlib was quite natural.

Data Pipeline: Amazon’s Kinesis

Amazon Kinesis is a cloud-based ser­vice for pro­cess­ing and stream­ing data at a large scale. Although we do not cur­rent­ly have a large influx of sup­port emails that would neces­si­tate such a solu­tion, we decid­ed to use Amazon’s Kinesis because of it’s scal­a­bil­i­ty and the ease of use. In addi­tion, learn­ing to use Kinesis would lev­el up our team for pro­cess­ing large scale data in real time.

The Classifier

The task of clas­si­fy­ing emails is not a new one. Spam fil­ters are a clas­sic exam­ple of this task. Rather than rein­vent the wheel, we decid­ed to use a tried and true approach: a Naïve Bayes clas­si­fi­er using word n-grams as features.

Naïve Bayes

Bayes the­o­rem (shown below), indi­cates that the prob­a­bil­i­ty that a cer­tain email is of class C_k giv­en that it has a cer­tain set of fea­tures: x_1, ..., x_n is pro­por­tion­al to the like­li­hood of that class (how often that class occurs in all the emails) times the prob­a­bil­i­ty of those fea­tures occur­ring in an email giv­en that an email is class C_k divid­ed by the prob­a­bil­i­ty of those fea­tures occurring.

Using the chain rule allows us to rewrite Bayes the­o­rem as:

We’re unlike­ly to have any way to esti­mate these com­plex con­di­tion­al prob­a­bil­i­ties, so we make the Naïve assump­tion that fea­tures are con­di­tion­al­ly independent:

This makes the prob­lem much more tractable and allows us to sim­pli­fy the ini­tial clas­si­fi­ca­tion prob­a­bil­i­ty to the prod­uct of sim­ple sin­gle fea­ture con­di­tion­al prob­a­bil­i­ties as shown below:

Data

To start, we export­ed all email data from Zendesk from 2012 to 2015. After remov­ing unla­beled emails or emails with out of date labels, we have 150,000 emails to use in build­ing and eval­u­at­ing the mod­el. 80% of these emails are used to train the mod­el while 20% are reserved as a test set for per­for­mance evaluation.

Model

A flow­chart show­ing the steps we took to build the clas­si­fi­er is seen above. We first tok­enize each email by cre­at­ing n-grams of one, two, three, four, and five words. After this, we removed all stop words” such as the” or and.” To find out which tokens are most impor­tant for dif­fer­en­ti­at­ing between dif­fer­ent cat­e­gories, we went through each email cat­e­go­ry and cal­cu­lat­ed the sig­nal to back­ground (S/B) ratio. The S/B ratio for a giv­en cat­e­go­ry and token is defined as the num­ber of emails con­tain­ing that token that are in the cat­e­go­ry divid­ed by the num­ber of emails con­tain­ing that token that are not in that cat­e­go­ry. For a spe­cif­ic cat­e­go­ry, call it A, this can be writ­ten: P(A | token)/(1 - P(A | token)). We want the S/B ratio to be fair­ly high so that we will use tokens that have a strong dis­crim­i­nat­ing pow­er, in our case we require S/B > 4.

Test Set Performance

The clas­si­fi­er was trained with Apache Spark MLlib’s imple­men­ta­tion of a Naïve Bayes clas­si­fi­er. The train­ing sam­ple con­sist­ed of 80% of the emails. The over­all accu­ra­cy of the clas­si­fi­er was eval­u­at­ed using the remain­ing 20% of emails that had been reserved as a test set. This accu­ra­cy was found to be 87.9% on the test set. The con­fu­sion matrix for the entire test set is dis­played below.

As you can see, the clas­si­fi­er per­forms very well on the top labels but very poor­ly on any labels that do not have many exam­ples. This is large­ly due to the fact that there is not enough data to dis­crim­i­nate between the fea­tures of these low-occu­pan­cy emails. The exis­tence of so many low-occu­pan­cy email cat­e­gories is large­ly due to the fact that the email label­ing used in Zendesk has been updat­ed recent­ly for cer­tain labels. With future data, the clas­si­fi­er can be retrained and its per­for­mance on many labels should improve dramatically.

Systematic Uncertainties

Although our clas­si­fi­er per­forms well on the 20% test set, this test set is not, in fact, rep­re­sen­ta­tive of the cur­rent label dis­tri­b­u­tion. The dis­tri­b­u­tion of the top five labels over time is shown below:

To see how this chang­ing label dis­tri­b­u­tion would affect the accu­ra­cy, we cal­cu­late the accu­ra­cy for a giv­en month by mul­ti­ply­ing the pre­ci­sion for each label by the num­ber of emails with that label and divid­ing by the total num­ber of emails in that month. Mathematically, this is rep­re­sent­ed by the fol­low­ing equation:

Where a_​m is the accu­ra­cy for month m, p_i is the pre­ci­sion for label i and n_{i,m} is the num­ber of emails in month m for label i. The fig­ure below shows the dis­tri­b­u­tion of cal­cu­lat­ed accu­ra­cies for each month in 2014 and 2015.

We expect that the mean of these accu­ra­cies will be sim­i­lar to the mean we will see in the future when this clas­si­fi­er is put into pro­duc­tion. In addi­tion, we can cal­cu­late our sys­tem­at­ic uncer­tain­ty on this mean by find­ing the dif­fer­ence between this mean and the 34.1% and 65.9% quar­tiles. This gives us an expect­ed accu­ra­cy of: 75.6% +5.8%/-6.8%.

Deployment

As men­tioned pre­vi­ous­ly, we chose to imple­ment this clas­si­fi­er with the com­bi­na­tion of Apache Spark’s MLlib and Amazon’s Kinesis. The use of both of these tools in tan­dem allows us to effort­less­ly scale the pipeline to han­dle wide­ly vary­ing loads.

The data pipeline con­sists of six prin­ci­pal steps, shown in the above flowchart:

  1. Emails col­lect­ed by Zendesk are batched and JSON-for­mat­ted by our Zendesk/​Kinesis Interface, writ­ten in Google’s Go language.
  2. The Zendesk/​Kinesis Interface then imple­ments a Kinesis Producer and pub­lish­es the email records to the input Kinesis shard.
  3. The Email Classifier, hav­ing loaded the lat­est mod­el from Amazon’s S3 [A.], con­nects to the input shard and reads the lat­est records. It then for­mats the emails into a Spark RDD and clas­si­fies them in parallel.
  4. With the emails clas­si­fied, the Spark job for­mats JSON records con­tain­ing the email-ID and the pre­dict­ed label. It batch­es these into 500 record batch­es and pub­lish­es them to a dif­fer­ent, out­put Kinesis shard.
  5. The Zendesk/​Kinesis Interface then recives these out­put records.
  6. With the labled emails, the Zendesk/​Kinesis Interface then mod­i­fies the web­mail form by pre-pop­u­lat­ing the cat­e­go­ry selec­tion with the pre­dict­ed label.

We can now scale this infra­struc­ture by sim­ply adding addi­tion­al Kinesis shards when IO lim­it­ed, or by adding Spark execu­tors if pro­cess­ing becomes a bottleneck.

Finally, as time pro­gress­es and we recieve more emails and feed­back from Coach Relations we can retrain the exist­ing mod­el, or cre­ate new mod­els all togeth­er, and sim­ply upload them to Amazon’s S3. The newest mod­el is then select­ed and imple­ment­ed auto­mat­i­cal­ly, allow­ing us to con­tin­u­al­ly opti­mize and improve.

Conclusions and Next Steps

Moving for­ward, we would like to make improve­ments to the email clas­si­fi­ca­tion per­for­mance so that the clas­si­fi­er can per­form bet­ter on cat­e­gories out­side the top six. One way for us to do this is to gath­er more labeled email data and use it in train­ing. We will grad­u­al­ly accu­mu­late more labeled emails as time goes on and as the Coach Relations team answers more emails, so this one will occur nat­u­ral­ly. A sec­ond way is to use a more advanced clas­si­fi­ca­tion scheme that does not make the naïve con­di­tion­al inde­pen­dence assump­tion used by Naïve Bayes. To this end, we have begun test­ing out some recur­rent neur­al networks.

Stay tuned and if you want to help us build recur­rent neur­al net­works or oth­er awe­some clas­si­fiers, con­tact us!