Over the last year, the Data Engineering squad has been building a data warehouse called Fulla. Recently, the squad rethought our entire data warehouse stack. We’ve now released Fulla v2 and Hudlies are querying data like never before giving us a better understanding of our customers and our product.
Every night, Fulla gets a fresh copy of most of our production data, which comes from SQL Server, MySQL and a handful of MongoDB clusters. We also parse all the logs from our web servers and append them to a logs table. When we say “Fulla” at Hudl, most people think of re:dash, an open source query execution app. However, Fulla is our ETL pipeline, Redshift, and re:dash.
There are two big challenges that make exporting data tricky at Hudl:
- A few years ago we bet the farm on MongoDB. Many old data models continued living in SQL Server (users, teams, schools, etc.), but new models were sent to MongoDB.
- More recently, we started moving to a microservices architecture.
Both moves have been great for development at Hudl. But for serious statistical analyses, we need all the data in one place. In the early days of data science at Hudl, exporting data was a highly manual (and fairly janky) process. It involved finding the router or primary node for the Mongo collection we cared about, running
mongoexport to an external drive attached to the server and copying the data to S3. Then, we would write a SQL query to get the rest of the business data we cared about and ship that to S3. If we wanted log data, we had to use the Splunk API to write a query, which felt a lot like draining the Atlantic with a coffee stir. Needless to say, we spent a large majority of our time moving data around, and not much time doing the more interesting things data scientists love to do.
We quickly realized we needed a data warehouse. We use S3 to feed Spark batch jobs so our initial thought was to build a Hive warehouse on top of S3 instead of HDFS. We thought, “If Netflix is doing it, how hard could it be?” As it turns out, very hard. Because we really wanted to use S3, we picked EMR as our Hadoop implementation. I won’t go in depth about this part of our journey, but here are a few problems we never found a good solution to:
- Cluster maintenance
EMR shines as an engine for batch jobs. It’s extremely easy to use and we love Amazon’s Spark on YARN implementation. But as a persistent Hive warehouse, results were mixed at best. Could we have gotten it to work? Possibly. But it was a big headache and we were eager to move away from it.
Enter Redshift. In June, we spent a few days with our AWS Solutions Architect learning about Redshift and spinning up a proof-of-concept cluster. It was love at first sight. Switching to Redshift solved all of the above problems we faced with Hive. The one tradeoff is that Redshift is more strict about the schema, but after using it for a few months I’m no longer convinced that this constraint is a negative.
This is the first in a three-part series on Fulla. The switch from Hive to Redshift has taken us from 4-5 diehard users to more than 80 Hudlies querying our data. This gives us a better view of the company, and we believe it’s going to promote a data-driven culture at Hudl and we want to share how we built it. The next post will give an overview of our ETL pipeline and describe how we process our logs so they can be queried in Redshift. After that, we’ll post about how we tail Mongo Oplogs to keep our copies of production data fresh and clean.