Queuing Up Heavy Tasks to an Autoscaling Worker Farm

An autoscaling farm of AWS EC2 instances sits behind our front-facing web application, working on heavy, long-running tasks like video transcoding, thumbnail generation, and computer vision processing. It’s a battle-tested combination of queues, worker instances, and an orchestration service called Lifeguard that easily hammers through thousands of these CPU-bound jobs per minute.

Queuing Up Heavy Tasks to an Autoscaling Worker Farm

Video is the number one reason people use Hudl. Coaches review video from prior games and analyze scout footage of the team they’re going to play next. College recruiters watch potential recruits, and athletes build highlight reels of their best moments.

Our application is always receiving video, essentially processing uploads 24/7. However, football season brings an extreme increase in upload volume. On a typical football season Friday night we receive film from over 10,000 different football teams – often of several different camera angles, and from both teams playing a game – amounting to around 8000 hours of video in a 12-hour window. At peak, we steadily bring in over 30 hours of video per minute.

The raw video all goes up into Amazon S3, but that’s not the end of it. Most of that footage needs some work. We might transcode it to different qualities and formats, generate thumbnails, or pass it through some computer vision processing to gather some data.

The video we get is typically broken up into “clips” of around 5 – 20 seconds each (think one clip = one football play). But doing all that work even on a short clip can take seconds to minutes, and also takes a good chunk of the CPU to do it. We use an autoscaling “worker farm” of AWS EC2 instances dedicated to handling these sorts of long-running, CPU-bound tasks. The composition of this worker farm and how we use it are the topics discussed herein.

Of Message Queues and Workers

The farm uses a fairly straightforward message-based producer/consumer system to manage the jobs we need to get done. Put together, it looks something like this:

Stepping through it:

1. Client applications upload video onto S3 and talk to our WebApp’s HTTP APIs to let them know about the new video (along with some other details).

2. The API endpoint in the web application builds a job message and queues it in an Amazon SQS queue. The job contains information about what needs to be done to the video, where the files are at on S3, and some identification data needed for post-processing. Job messages are just serialized DTOs, and might look something like:

[DataContract]
public class Mp4InterleaveJob : BaseEncodeJob
{
    [DataMember] public S3File InputFile { get; set; }
    [DataMember] public S3OutputFile OutputFile { get; set; }
    [DataMember] public byte MediaQuality { get; set; }
}

3. An EC2 worker pulls an item off the front of the queue and works on it. This usually involves pulling the video files down from S3 and running them through something like FFmpeg (using an abstraction library of ours) or MP4Box. The resulting files are put back up on S3.

4. When done, the worker pushes a result message into a different SQS queue dedicated to results. The result message is just another DTO and typically contains S3 paths to the new files that the worker created.

5. One of several backend result processors reads from the results queue and finishes things up. The result processors are part of a different producer/consumer system that’s more closely tied to the application code and can connect to our primary databases. During this post-processing work, the result processors are typically updating SQL Server or MongoDB records with the S3 paths and other data from the result message.

This works pretty well for us. The queues protect us from bursts and give us time to react to increased load. But, as mentioned earlier, the rate of messages into the job queue is far from constant. It can go from 33 jobs/minute during the week to 8200 jobs/minute on Friday nights:

Jobs ingested per hour during a typical football season week.

The jobs don’t need to be completed immediately, but we do want reasonably quick turnaround (what “quick” means is up to us, and varies per type of job — usually it’s just a few minutes). However, to account for the ever-changing rate of messages, the number of workers needs to be adjusted frequently. We could always run at maximum capacity, but don’t want to be over-provisioned all week just so we can handle weekends; that’d be a waste of money.

These reactive worker count adjustments shouldn’t be manual. Nobody needs to be on-call at all times monitoring queue lengths and turning digital dials to adjust worker counts. That sort of thing can and should be automatic.

Autoscaling With Lifeguard

Autoscaling (dynamically changing the size of a pool of instances) isn’t a new concept, and most cloud providers these days make it possible in some form. The brains of our autoscaling is a service called Lifeguard. We manage the service ourselves (i.e Lifeguard’s not a built-in AWS offering), but it offers us a bit more flexibility and control beyond what AWS’ native autoscaling groups allow.

Lifeguard is an orchestration service that monitors queue sizes and decides if it needs to create more or terminate existing workers. Our version of the service started as a C# port of an existing Java project with the same name, but has undergone a number of modifications for our workflows. We run it with Mono on an Ubuntu EC2 instance. Here’s how (roughly) it works in a scale-up scenario:

Lifeguard continuously monitors the size of a queue and the number of workers consuming from it, and decides one of three outcomes: do nothing, add more workers, or terminate some workers.
If Lifeguard determines more workers are needed, it makes an API request to EC2 to create more instances. When they’re created, worker instances are passed configuration (e.g. queue identifiers to consume from) via EC2 instance user-data.
The worker pulls down a deployment payload (the latest build from our build server) that contains the code/executables that will consume messages and work on them. It starts this application code up as a service and starts consuming from the queue.

Here’s what the instance scaling looks like over a 10-hour period:

Highlight pool queue length + requested and active workers. Note the shared Y-axis; units are queue size for red, worker count for yellow/blue.

You can see the surging queue sizes (red) and the system’s reaction by requesting (yellow) more workers that are then quickly started up (blue) to handle the load.

You might notice that it takes a while for machines to spin down – for the large spike around 9am, the added workers cranked through it pretty quickly, but then stayed active for about an hour before Lifeguard decided to take them out. This is to maximize the instance’s use with regard to how it’s billed by AWS. If you launch a new AWS instance, you’re billed for a minimum one full hour of usage for that instance, even if you only have it online for a few minutes. Lifeguard keeps any instance it spins up online for the full hour — we already paid for it, so we might as well get our money’s worth. This helps avoid paying extra when there’s lots of spiking traffic in short windows of time.

If you’re running your app in the cloud, most providers already give you the ability to autoscale. Here’s a quick, far-from-exhaustive summary of a few providers with some links if you’re interested in experimenting or exploring a particular platform’s behavior.

AWS EC2 — Can scale instances dynamically using Cloudwatch Metrics or via schedule-based events. CloudWatch metrics for non-EC2 services (e.g. SQS) can fire alarms to custom autoscaling ARNs for flexible scaling systems.
Google App Engine — Not built-in, but APIs can be used to build an orchestration engine. GAE provides a framework example and sample/starter applications with code to help you get started.
Microsoft Azure — Has an Autoscaling Application Block that can be configured with constraints and reactive rules. Can use Performance Counters and other Azure resources like queues to trigger scaling events.
AWS Elastic Beanstalk — Since it’s an AWS service, has built-in scaling that uses EC2’s functionality.
Heroku — Dynos can be scaled using the tiered-pricing Adept Scale add-on. Scales on expected response time after heuristically evaluating past application behavior.

Several of these have a base set of similar functionality. They automatically scale up and down based on a set of observed metrics (e.g. CPU utilization, queue sizes, page response times) and can have configured upper and lower constraints and added/removed instance counts.

Separate Worker Pools, Independently Scaled

Our worker farm is responsible for, at current count, 23 different types of jobs. Jobs can be grouped into pools, with each pool getting its own queues and workers that are managed independently from the others.

Pools of workers and their queues are managed independently.

Each pool is tuned in a way that makes sense for the jobs it’s working on. Some pools spin up more workers more quickly than others. Some pools keep a higher minimum worker count on hand because we expect frequent surges of messages and want to stay ahead of them. As an example, our transcoding pool keeps a 50-worker minimum, and scales up (10 servers at a time) to nearly 1000 on busy Friday nights.

The worker farm has become our go-to system for anything that should be done asynchronously and is long-running or CPU-bound. Here are some of the jobs that go through it:

Thumbnail Generation — For each video we receive, we generate several thumbnails of different sizes. We create around one million thumbnails each week when we’re busy, 99% of them taking five seconds or less to generate.
Fundraising Campaign Video Rendering — Via a workflow in our web application, coaches can create fundraising campaign videos that contain team highlights and an audio message. We combine the audio, video, and a few other elements into a rendered MP4 that the coach and athletes can share with parents, fans, and other people who might contribute to their program. If you’re interested in more about how we build those (using Twilio and FFmpeg), check out this post from earlier this year.
MP4 Concatenation — Coaches have the ability to take groups of their small, 5 – 20 second clips and combine them together into single MP4 or WMV files that can be downloaded. The duration of these jobs varies with the number of clips being concatenated; on average these take between 5 – 10 minutes to complete.
Rendering Highlights — Athletes and coaches use rich video tools in our application to mark great plays, while also adding features like overlaid spot shadows, title slides, music, and slow-motion. We render these into MP4s that are more easily shared and played back on mobile devices. During football season, it’s not uncommon for us to render upwards of 100,000 highlights a day. Our average render time for these is around three minutes.

Additional Behavior

Monitoring. For monitoring and troubleshooting, we (like with all of our applications) throw tons of logs into Splunk, using it to build real-time dashboards and trigger alerts if things aren’t behaving well (e.g. pools aren’t trying to scale, error rates are high, or jobs are running more slowly than usual).

Quick worker code updates. The workers check for code updates between each job, allowing us to upgrade the worker code pretty quickly (and without needing to rebuild the pool or spin up new instances). Because of this, every change we make has to be backwards compatible. We don’t wait for queues to flush before we update code, so if we change message structures the code has to be able to handle the old and new messages (at least for some time).

Priority queues. Each job queue also has a priority queue that gets consumed by the workers first, which lets us push more important or time-sensitive jobs through a worker pool a little more quickly.

Worker status reporting. Workers fire status and health updates to a SimpleDB store that gets watched by Lifeguard. If a worker fails to send an update for a period of time, or is producing a high percentage of errors, Lifeguard will consider it unhealthy, and attempt to terminate it and replace it with a new worker.

Closing Thoughts

We’ve used the farm for critical production traffic for several years now; it’s a solid, reliable piece of our architecture. Autoscaling saves us money, and the isolated pools help us optimize scaling rules to be ideal for the jobs they’re managing. When faced with a problem that needs an asynchronous, compute-heavy solution, the farm’s the first place we go to prototype things out.

Recognize when it makes sense to take work off of your primary application servers and dedicate CPU/GPU time to it. If applicable, leverage built-in autoscaling from your cloud provider or try wiring up their components (instances, queues, monitoring) in a way that makes sense for the problems you’re trying to solve. Make sure to log, monitor, and alert to ensure that systems and applications behave the way you expect.

As we grow and expand, we’ll undoubtedly be adding new pools and pushing more work into our existing job queues, which brings opportunities for improvements and optimizations. We’re excited to see what the next generation of the worker farm looks like for us.

Queuing Up Heavy Tasks to an Autoscaling Worker Farm

Of Message Queues and Workers

Autoscaling With Lifeguard

Separate Worker Pools, Independently Scaled

Additional Behavior

Closing Thoughts

Recommended Reading

Migrating Millions of Users in Broad Daylight

Faster and Cheaper: How Hudl Saved 50% and Doubled Performance

How Our Product Team Works