Populating Fulla with SQL Data and Application Logs

This is the sec­ond in a series of posts on Fulla, Hudl’s data ware­house. This post dis­cuss­es our meth­ods to update Fulla dai­ly with data from our pro­duc­tion SQL data­bas­es and our appli­ca­tion logs.

Populating Fulla with SQL Data and Application Logs

This is the sec­ond in a series of posts on Fulla, Hudl’s data ware­house. This post dis­cuss­es our meth­ods to update Fulla dai­ly with data from our pro­duc­tion SQL data­bas­es and our appli­ca­tion logs.

This is the sec­ond post about Hudl’s inter­nal data ware­house, Fulla. The first post intro­duced Fulla, dis­cussed our desires in a data ware­house, and detailed our path to Fulla v2. This post will cov­er our ETL (extract-trans­form-load) pipeline to get Hudl’s pro­duc­tion SQL and appli­ca­tion log data into Fulla on a reg­u­lar schedule.

Production SQL Data

We have three major sources of data that go into Fulla. First, we have our pro­duc­tion rela­tion­al data­bas­es. We have two rela­tion­al data­bas­es, one SQL Server and one MySQL. They are each about 100 GB, which makes this source eas­i­ly the small­est of our three sources.

To keep Fulla in sync with our rela­tion­al data­bas­es, we exe­cute night­ly batch jobs to refresh the data. For each data­base, we spin up an Elastic MapReduce (EMR) clus­ter to run a Sqoop job against a read repli­ca. The data is saved in Avro for­mat on S3. Once a table has been export­ed to S3, we trun­cate the table in Fulla and upload the newest data. Given the sim­i­lar­i­ty between tables in SQL Server and MySQL and tables in Redshift, as well as the com­pa­ra­bly small size of our rela­tion­al data­bas­es, this has been the eas­i­est part of our ETL pipeline.

Application Logs

Our sec­ond source of data is our logs. Hudl made a deci­sion a while back to Log All the Things. As a result, we gen­er­ate a few hun­dred mil­lion appli­ca­tion log mes­sages per day dur­ing our busy sea­sons. We recent­ly moved to Sumo Logic (from Splunk) to col­lect and review our log data. Sumo Logic is great for real-time queries, but it slows down when doing mul­ti-day queries over a broad set of logs. Further, Sumo Logic only stores data for the last 60 days, mak­ing it impos­si­ble to do year-over-year comparisons.

Sumo Logic sends rolling archives of our logs to an S3 buck­et, so get­ting the raw data isn’t dif­fi­cult. However, we don’t have strict­ly struc­tured logs, so map­ping an indi­vid­ual log onto a colum­nar data­base like Redshift is non-triv­ial. We decid­ed to per­form a few ETL steps before load­ing. First, we made the deci­sion to keep only appli­ca­tion logs and drop oth­er logs (e.g. Nginx) for which the vol­ume out­weighed the util­i­ty. Second, we didn’t want to just load the raw log mes­sage to Redshift, as this would require sig­nif­i­cant regex pars­ing and sub­string extrac­tion by Fulla users. To make it eas­i­er for Hudlies to query log mes­sages, we decid­ed to han­dle some ini­tial pars­ing before load­ing into Redshift.

For exam­ple, a typ­i­cal log mes­sage looks like the following:

2015-10-11 11:22:33,444 [INFO] [Audit] [request_id=c523a1e3] App=Hudl,Func=View,Op=Clip,Ip=6.10.16.23,AuthUser=12345,Team=555,Attributes=[Clip=4944337215,TimeMs=10]

It includes a time­stamp, a host­name, log lev­el, and a bunch of key-val­ue pairs. Many of these key-val­ue pairs are used in almost all log mes­sages, such as App, Func, Op, and AuthUser. We ana­lyzed a set of log mes­sages to find the most com­mon­ly-used 100 key-val­ue pairs. We then wrote a job using regex pat­terns for each of those key-val­ue pairs to extract the keys and val­ues from the mes­sage. We’ve also allowed users to request addi­tion­al key-val­ue pairs to parse out. Now, in addi­tion to the com­mon log val­ues of time, host, and even the raw mes­sage, our logs table in Fulla has over 140 oth­er columns that Hudlies can query. Wide tables aren’t a con­cern in Redshift. It’s a colum­nar data­base, so it only fetch­es the request­ed fields on reads rather than the entire row.

As an added ben­e­fit, it has been rel­a­tive­ly easy to tran­si­tion a query from Sumo Logic to Fulla. Hudlies are Sumo Logic pow­er users, whether it’s some­one from our awe­some coach sup­port team look­ing for an acci­den­tal­ly delet­ed play­book or a Product Manager check­ing the impact of a new fea­ture. Rather than forc­ing them to mem­o­rize two ways of access­ing data, they can use most of the same log­ic. This has been a strong fac­tor in our quick adop­tion rate as over 180 Hudlies have signed up for Fulla, rep­re­sent­ing near­ly half of all employees.

Lessons Learned

The tran­si­tion to Fulla v2 has gone pret­ty well, and we’ve learned a lot along the way. Below are a few of our more help­ful tips.

Automate, Automate, Automate.

The data in your data ware­house should update auto­mat­i­cal­ly, with­out some­one need­ing to push a but­ton. Your pipelines will need tweak­ing and require a fair bit of babysit­ting at first, but they will sta­bi­lize even­tu­al­ly. Think of the pieces that are shared across pipelines, and see how you can abstract them to be used in new pipelines.

For exam­ple, we have a script to launch an EMR clus­ter. You can spec­i­fy a num­ber of nodes, uti­lize spot pric­ing, and run any num­ber of arbi­trary jobs. To cre­ate a new job, you just need to use a defined for­mat – have a run​.sh script that will pre­pare your envi­ron­ment and run the job, then zip up the direc­to­ry and post it to S3. Our sched­uler spins up new clus­ters every night to run the jobs, and they ter­mi­nate when they’re fin­ished. This has made it real­ly easy to add new com­po­nents to our ETL pipeline and has saved us a lot of mon­ey by avoid­ing the use of long-run­ning EMR clus­ters wait­ing for a task.

Use a Workflow Management tool

We make heavy use of Spotify’s Luigi in our pipelines. It is great for man­ag­ing depen­den­cies when exe­cut­ing long chains of tasks. Make sure you take full advan­tage of it by com­pos­ing your jobs into the small­est unit that makes sense. For exam­ple, our Sqoop job on our SQL Server exports data from 80 dif­fer­ent tables. Rather than do a sin­gle sqoop import-all task in Luigi, we do a sqoop import task for each of the 80 tables. If our spot instances get ter­mi­nat­ed after export­ing 79 of 80 tables, we can quick­ly pick up where we left off rather than start­ing over at the begin­ning. Luigi sees that we’re only miss­ing data in S3 for one of the tables, so it does a sqoop import on that table, then pro­ceeds to clean­ing and load­ing to Redshift.

There are a num­ber of oth­er work­flow man­age­ment tools out there, like Airbnb’s Airflow. Find one that works for you (or build your own, if you’re adventurous).

Know your data

As you’re putting togeth­er your data ware­house, it’s impor­tant to know how your data is stored and how your users will access it. This will come into play not just on SORTKEYS and DISTKEYS for Redshift, but in how your ETL pipeline is structured.

As one exam­ple, I not­ed that we trun­cate all the tables from our rela­tion­al data­bas­es and load entire­ly fresh data in every morn­ing. It’s pos­si­ble to update tables from a rela­tion­al data­base in Redshift with­out doing this, by updat­ing only those rows that changed. We briefly dis­cussed stream­ing the bina­ry log from MySQL and the trans­ac­tion log from SQL Server, sav­ing the updates to a stag­ing table, and merg­ing on a more fre­quent basis. However, we ulti­mate­ly decid­ed this wasn’t worth the engi­neer­ing time. Our rela­tion­al data­bas­es are not that big, and they don’t con­tain our most fre­quent­ly request­ed data. Running the Sqoop job and copy­ing the data into Redshift takes about 2 hours with a 5 node EMR clus­ter. This ends up cost­ing us less than $3 per day, as we use spot instances with ephemer­al EMR clusters.

Our third post will dis­cuss our ini­tial per­for­mance issues with our design and the steps we took to improve the Fulla expe­ri­ence for our users.