The Low-Hanging Fruit of Redshift Performance

This post is a case-study about how we fixed some Redshift per­for­mance issues we start­ed run­ning in to as we had more and more peo­ple using it. A lot of the infor­ma­tion I present is doc­u­ment­ed in Redshift’s best prac­tices, but some of it isn’t eas­i­ly find­able or wasn’t obvi­ous to us when we were just get­ting start­ed. I hope that read­ing this post will save you some time if you are just get­ting start­ed with Redshift and want to avoid some of the pit­falls that we ran into!

The Low-Hanging Fruit of Redshift Performance

This post is a case-study about how we fixed some Redshift per­for­mance issues we start­ed run­ning in to as we had more and more peo­ple using it. A lot of the infor­ma­tion I present is doc­u­ment­ed in Redshift’s best prac­tices, but some of it isn’t eas­i­ly find­able or wasn’t obvi­ous to us when we were just get­ting start­ed. I hope that read­ing this post will save you some time if you are just get­ting start­ed with Redshift and want to avoid some of the pit­falls that we ran into!

This post is part three of our blog series about Fulla, Hudl’s data ware­house. Part one talked about mov­ing from Hive to Redshift and part two talked about load­ing in our appli­ca­tion log­ging data. Part three was sup­posed to be about tail­ing the MongoDB oplog, but I decid­ed to take a brief detour and tell you about some per­for­mance issues we start­ed to run into after we hashed out our ETL cycle for log­ging data.

To begin, let’s rewind about a month. We had about 30 peo­ple using our Redshift clus­ter reg­u­lar­ly, a few of whom were pow­er users” who used it con­stant­ly. Redshift was prov­ing itself to be a great tool to learn about our users and how they use Hudl. However, with increased usage, bot­tle­necks became a lot more notice­able and increas­ing­ly painful to deal with. Queries that used to take 30 sec­onds were tak­ing min­utes. Some longer queries were tak­ing more than an hour. While this might be rea­son­able for large ana­lyt­i­cal batch-jobs, it’s hard­ly accept­able for inter­ac­tive querying.

We quick­ly learned that the biggest source of laten­cy for us was our logs table. Some brief back­ground on the logs table: it con­tains all of our appli­ca­tion logs with com­mon fields parsed out. Most of our appli­ca­tion logs con­tain key-val­ue pairs that look some­thing like:

... Operation=View, Function=Highlight, App=Android, User=123456...

We parse out about 200 of the most com­mon key-val­ue pairs into their own spe­cial col­umn in the logs table. I won’t go any fur­ther into how we pop­u­late the logs table since that’s already been cov­ered in part two of this series. Instead I’m going to talk about some prob­lems with the table.

Problems with the Logs Table

It’s a super use­ful table!
It con­tains infor­ma­tion about how our users are actu­al­ly using our prod­ucts. Everyone wants to use it in almost every query. This is actu­al­ly a real­ly good prob­lem to have. It means we’re pro­vid­ing peo­ple with some­thing use­ful. However, it exac­er­bates some of our oth­er problems.

It’s huge
It has more than 30 bil­lion rows. This is a prob­lem for sev­er­al rea­sons. It requires queries to sift through much more data than they need to. It also makes it pro­hib­i­tive­ly slow to vac­u­um, which brings me to the next problem…

It was com­plete­ly un-vac­u­umed
In Redshift, vac­u­um­ing does two things: 1) sorts the table based on a pre­de­fined sortkey and 2) reclaims unused disk space left after delete or update oper­a­tions (here’s a good SO answer on why this is nec­es­sary). The logs table is gen­er­al­ly only append­ed to, so we don’t real­ly need to wor­ry about reclaim­ing delet­ed space, how­ev­er the sort­ing aspect is dras­ti­cal­ly impor­tant for us. The logs table had a sortkey, but the table was 0% sort­ed so most queries against the table had to do large table scans. We were also wor­ried that it would take weeks to actu­al­ly vac­u­um the table and would use all the cluster’s resources in the process.

None of the columns had encod­ings
This was prob­lem­at­ic because it meant we weren’t using our disk space and mem­o­ry effi­cient­ly. As a result the table took up 20 TB of our 32 TB clus­ter. In short, this means we couldn’t ever make a copy of the table to fix our problems.

It had a real­ly wide col­umn
We store the raw log mes­sage in a col­umn along­side all the parsed out fields. If some­one want­ed a field from a log that we didn’t parse out, they had to extract it from the message field using regex or SQL wild­cards. It was made even worse by the fact that the message col­umn was of type VARCHAR(MAX) which means that 64K of mem­o­ry had to be allo­cat­ed for every sin­gle record that was read into mem­o­ry. Even if the actu­al log mes­sage was only 1K, it still had to allo­cate 64K of mem­o­ry for it. As a result all of the queries that touched the mes­sage field had to write to disk a lot when they ran out of mem­o­ry. It was just… just awful.

Let’s Fix It!

As I men­tioned above, our hum­ble two-node Redshift clus­ter only has 32TB of space so there’s no way we could copy the 20 TB table with­in the same clus­ter. Additionally, we want­ed to fix this prob­lem with­out hin­der­ing per­for­mance and with min­i­mal down­time since some peo­ple use it for their full-time job. With all that in mind, here’s the plan we came up with:

1. Spin up a new Redshift clus­ter from one of the snap­shots that are tak­en reg­u­lar­ly. The AWS Redshift con­sole makes this super easy to do in about three clicks:

2. Drop the logs table on the new clus­ter. Re-cre­ate the table with messagetime as the sortkey and with the mes­sage col­umn trun­cat­ed to VARCHAR(1024). We chose VARCHAR(1024) because we real­ized >99% of the logs we were stor­ing were short­er than that. What’s more, the logs that were longer were usu­al­ly long stack traces which were typ­i­cal­ly not use­ful in the kind of offline analy­sis we use Fulla for.

3. Use the copy com­mand to load in the most recent month’s logs into the table from S3. This will auto­mat­i­cal­ly set sen­si­ble col­umn encod­ings for us.

4. Use the copy com­mand to load the rest of the logs from S3. This took about 18 hours total. It’s some­what inter­est­ing how disk usage increas­es dur­ing a COPY and decreas­es after­wards (shown below). I assume this is because the files that were pulled from S3 get cleaned up after they’re loaded into the table.

5. Run vacuum logs;. This took about 4 days total. After we do all that, the new clus­ter is ready to use! We just need to switch over from the old clus­ter. Luckily, Redshift makes it pos­si­ble to do this with min­i­mal down­time and with­out need­ing to update the con­nec­tion string on all of your clients! We renamed the old clus­ter from ful­la” to ful­la-old”, then renamed the new clus­ter from ful­la-new” to ful­la”. Overall, the switchover took about 10 min­utes total and peo­ple were ready to start using the shiny new logs table.

After

Some brief stats from after the migra­tion. Between the col­umn encod­ings, the trun­cat­ed mes­sage col­umn, and hav­ing the table vac­u­umed the total table size went from 20 TB to 5 TB. The num­ber of slow queries” we were detect­ing per day dropped sig­nif­i­cant­ly as can be seen in my fan­cy excel graph below:

During this time, over­all usage has been steadi­ly increas­ing as we’ve been edu­cat­ing more peo­ple on how to use Fulla:

Despite this increase in usage, clus­ter per­for­mance has improved greatly!

Next Steps

Our improve­ments are not yet com­plete. The logs table is still the biggest bot­tle­neck of work­ing with our Redshift clus­ter. We’re putting a process in place to extract prod­uct-spe­cif­ic and team-spe­cif­ic tables from the log data rather than hav­ing every­one use one giant logolith (seewhatididthere.jpg). Another impend­ing prob­lem is that the logs table takes about 12 hours to vac­u­um. We kick off a vac­u­um job each night after we load in the logs from the pre­vi­ous day. To get that time down, we’re plan­ning on break­ing off logs from pre­vi­ous years and cre­at­ing a view over mul­ti­ple tables.

Main Takeaways

  1. Vacuum your red­shift tables. Vacuum ear­ly and vac­u­um often.
  2. Make sure you have prop­er col­umn encod­ings. The best way to do this is to let the copy com­mand auto­mat­i­cal­ly set them for you if possible.
  3. Avoid VARCHAR(MAX). When you have VARCHAR fields, be real­ly con­ser­v­a­tive with their size. Don’t just default to VARCHAR(MAX) to avoid think­ing about your data. There are real dis­ad­van­tages to hav­ing wide fields.

If you are work­ing with Redshift and this arti­cle helped you, please let me know! We’re learn­ing more about it every day and I’d love to swap tips and tricks with you. Stay tuned for part 4 of this series where I tell you all about tail­ing the MongoDB oplog.