Hello Fulla

Over the last year, the Data Engineering squad has been build­ing a data ware­house called Fulla. Recently, the squad rethought our entire data ware­house stack. We’ve now released Fulla v2 and Hudlies are query­ing data like nev­er before giv­ing us a bet­ter under­stand­ing of our cus­tomers and our product.

Hello Fulla

Over the last year, the Data Engineering squad has been build­ing a data ware­house called Fulla. Recently, the squad rethought our entire data ware­house stack. We’ve now released Fulla v2 and Hudlies are query­ing data like nev­er before giv­ing us a bet­ter under­stand­ing of our cus­tomers and our product.

Over the last year, the Data Engineering squad has been build­ing a data ware­house called Fulla. Recently, the squad rethought our entire data ware­house stack. We’ve now released Fulla v2 and Hudlies are query­ing data like nev­er before giv­ing us a bet­ter under­stand­ing of our cus­tomers and our product.

Every night, Fulla gets a fresh copy of most of our pro­duc­tion data, which comes from SQL Server, MySQL and a hand­ful of MongoDB clus­ters. We also parse all the logs from our web servers and append them to a logs table. When we say Fulla” at Hudl, most peo­ple think of re:dash, an open source query exe­cu­tion app. However, Fulla is our ETL pipeline, Redshift, and re:dash.

There are two big chal­lenges that make export­ing data tricky at Hudl:

  1. A few years ago we bet the farm on MongoDB. Many old data mod­els con­tin­ued liv­ing in SQL Server (users, teams, schools, etc.), but new mod­els were sent to MongoDB.
  2. More recent­ly, we start­ed mov­ing to a microser­vices archi­tec­ture.

Both moves have been great for devel­op­ment at Hudl. But for seri­ous sta­tis­ti­cal analy­ses, we need all the data in one place. In the ear­ly days of data sci­ence at Hudl, export­ing data was a high­ly man­u­al (and fair­ly janky) process. It involved find­ing the router or pri­ma­ry node for the Mongo col­lec­tion we cared about, run­ning mongoexport to an exter­nal dri­ve attached to the serv­er and copy­ing the data to S3. Then, we would write a SQL query to get the rest of the busi­ness data we cared about and ship that to S3. If we want­ed log data, we had to use the Splunk API to write a query, which felt a lot like drain­ing the Atlantic with a cof­fee stir. Needless to say, we spent a large major­i­ty of our time mov­ing data around, and not much time doing the more inter­est­ing things data sci­en­tists love to do.

We quick­ly real­ized we need­ed a data ware­house. We use S3 to feed Spark batch jobs so our ini­tial thought was to build a Hive ware­house on top of S3 instead of HDFS. We thought, If Netflix is doing it, how hard could it be?” As it turns out, very hard. Because we real­ly want­ed to use S3, we picked EMR as our Hadoop imple­men­ta­tion. I won’t go in depth about this part of our jour­ney, but here are a few prob­lems we nev­er found a good solu­tion to:

  • Serialization/​Deserialization
  • Latency
  • Multi-ten­an­cy
  • Cluster main­te­nance

EMR shines as an engine for batch jobs. It’s extreme­ly easy to use and we love Amazon’s Spark on YARN imple­men­ta­tion. But as a per­sis­tent Hive ware­house, results were mixed at best. Could we have got­ten it to work? Possibly. But it was a big headache and we were eager to move away from it.

Enter Redshift. In June, we spent a few days with our AWS Solutions Architect learn­ing about Redshift and spin­ning up a proof-of-con­cept clus­ter. It was love at first sight. Switching to Redshift solved all of the above prob­lems we faced with Hive. The one trade­off is that Redshift is more strict about the schema, but after using it for a few months I’m no longer con­vinced that this con­straint is a negative.

This is the first in a three-part series on Fulla. The switch from Hive to Redshift has tak­en us from 4 – 5 diehard users to more than 80 Hudlies query­ing our data. This gives us a bet­ter view of the com­pa­ny, and we believe it’s going to pro­mote a data-dri­ven cul­ture at Hudl and we want to share how we built it. The next post will give an overview of our ETL pipeline and describe how we process our logs so they can be queried in Redshift. After that, we’ll post about how we tail Mongo Oplogs to keep our copies of pro­duc­tion data fresh and clean.