How We Stay Sane with a Large AWS Infrastructure

We’ve been run­ning hudl​.com in AWS since 2009 and have grown to run­ning hun­dreds, at times even thou­sands of servers. As our busi­ness grew, we devel­oped a few stan­dards that help us make sense of our large AWS infrastructure.

How We Stay Sane with a Large AWS Infrastructure

We’ve been run­ning hudl​.com in AWS since 2009 and have grown to run­ning hun­dreds, at times even thou­sands of servers. As our busi­ness grew, we devel­oped a few stan­dards that help us make sense of our large AWS infrastructure.

We’ve been run­ning hudl​.com in AWS since 2009 and have grown to run­ning hun­dreds, at times even thou­sands of servers. As our busi­ness grew, we devel­oped a few stan­dards that help us make sense of our large AWS infrastructure.

Names and Tags

We use three cus­tom tags for our instances, EBS vol­umes, RDS and Redshift data­bas­es, and any­thing else that sup­ports tag­ging. They are extreme­ly use­ful for cost analy­sis but are also use­ful for run­ning com­mands like describeInstances.

  • Environment — we use one AWS account for our envi­ron­ments. This tag helps us dif­fer­en­ti­ate resources. We only use four val­ues for this: test, inter­nal, stage or prod.
  • Group — this is ad-hoc, and typ­i­cal­ly denotes a sin­gle microser­vice, team or project. Because there are many projects ongo­ing at any giv­en time, we dis­cour­age abbre­vi­a­tions to improve clar­i­fy. Examples at Hudl: mono­lith, cms, users, teamcity
  • Role — with­in a group, this denotes the role this instance plays. Like RoleNginx, RoleRedis or RoleRedshift.

We also name our instances. It makes talk­ing about instances eas­i­er, which can help when fire­fight­ing. We use Sumo Logic for log aggre­ga­tion, and our _​sourceName (ie: host name) val­ues match up with our EC2 instance names. That makes com­par­ing logs and CloudWatch met­rics eas­i­er. We pack a lot of infor­ma­tion into the name:

At a glance I can tell this is a pro­duc­tion instance that sup­ports our mono­lith. It’s a RabbitMQ serv­er in the D’ avail­abil­i­ty zone of the us-east-1’ region. To account for mul­ti­ple instances of the same type, we tack on the id’ val­ue, in this case it’s the first of its kind. For servers which are pro­vi­sioned via Auto-Scaling Groups, instead of a two-dig­it num­ber we use a six-dig­it hash. Short enough that humans can keep it in short-term mem­o­ry and long enough to pro­vide uniqueness.

Security Groups & IAM Roles

Note If you are famil­iar with Security Groups and IAM Roles, skip this para­graph. Security groups are sim­ple fire­walls for EC2 instances. We can open ports to spe­cif­ic IPs or IP ranges or we can ref­er­ence oth­er secu­ri­ty groups. For exam­ple, we might open port 22 to our office net­work. IAM Roles are how instances are grant­ed per­mis­sions to call oth­er AWS web ser­vices. These are use­ful in a num­ber of ways. Our data­base instances all run reg­u­lar back­up scripts. Part of that script is to upload the back­ups to S3. IAM Roles allow us to grant S3 upload abil­i­ty but only to our back­ups S3 buck­et and they can only upload, not read or delete.

We have a few helper secu­ri­ty groups like man­age­ment’ and chef’. When new instances are pro­vi­sioned we cre­ate a secu­ri­ty group that match­es the {environment}-{group}-{role} nam­ing con­ven­tion. This is how we keep our secu­ri­ty groups min­i­mal­ly exposed. The nam­ing makes it eas­i­er to rea­son about and audit. If we see an s-” secu­ri­ty group ref­er­enced from a p-” secu­ri­ty group, we know there’s a problem.

We keep the {environment}-{group}-{role} con­ven­tion for our IAM Role names. Again, this lets us grant min­i­mal AWS priv­i­leges to each instance but is easy for us humans to be sure we are viewing/​editing the cor­rect roles.

Wrap it Up

We’ve adopt­ed these nam­ing con­ven­tions and made it just part of how folks pro­vi­sion AWS resources at Hudl. It makes it eas­i­er to under­stand how our servers are relat­ed to each oth­er, who can com­mu­ni­cate with who and on what ports, and we can pre­cise­ly fil­ter via the API or from the man­age­ment con­sole. For very small infra­struc­tures, this lev­el of detail is prob­a­bly unnec­es­sary. However, as you grow beyond tens and def­i­nite­ly past hun­dreds of servers, stan­dards like these will keep your engi­neer­ing teams sane.