Measuring Availability: Instead of Nines, Let’s Count Minutes

It’s hard to find detailed expla­na­tions about how com­pa­nies go about com­put­ing and track­ing their avail­abil­i­ty, par­tic­u­lar­ly for com­plex SaaS web­sites. Here’s how we do it for our pri­ma­ry web appli­ca­tion, hudl​.com.

Measuring Availability: Instead of Nines, Let’s Count Minutes

It’s hard to find detailed expla­na­tions about how com­pa­nies go about com­put­ing and track­ing their avail­abil­i­ty, par­tic­u­lar­ly for com­plex SaaS web­sites. Here’s how we do it for our pri­ma­ry web appli­ca­tion, hudl​.com.

Running sites with high avail­abil­i­ty is a fore­gone con­clu­sion for most busi­ness­es. Availability is pret­ty easy to abstract­ly define, but rarely explained with real exam­ples. Most like­ly, the first half-dozen Google results you come across when search­ing about it will spend many words mus­ing about nines” and equat­ing them to pre­cise minute val­ues and telling you just how many of these nines” you might need. It’s hard­er to find more detailed expla­na­tions about how com­pa­nies go about com­put­ing and track­ing their avail­abil­i­ty, par­tic­u­lar­ly for com­plex SaaS web­sites. Here’s how we do it for our pri­ma­ry web appli­ca­tion, hudl​.com.

Calculating Overall Server-Side Availability

We mea­sure our serv­er-side avail­abil­i­ty per-minute by aggre­gat­ing access logs from our NGINX servers, which sit at the top of our appli­ca­tion stack.

Our NGINX logs are sim­i­lar to the default for­mat, and for avail­abil­i­ty track­ing we’re inter­est­ed in the sta­tus code and elapsed time. We tack on some ser­vice infor­ma­tion, which I’ll talk about in a bit.

For each minute, the num­ber of suc­cess­ful and failed respons­es are count­ed. A request is con­sid­ered unsuc­cess­ful if we respond with a 5XX HTTP sta­tus or if it takes longer than five sec­onds to complete.

Each indi­vid­ual minute is then cat­e­go­rized by its per­cent­age of suc­cess­ful requests:

  • If less than 90% of requests suc­ceed, the minute is con­sid­ered down.
  • If greater than 90%, but less than 99%, the minute is degrad­ed.
  • Otherwise (>= 99%), the minute is up.

The down/​degraded/​up buck­ets help reflect sea­son­al­i­ty by weight­ing the more heav­i­ly-accessed fea­tures, and also help sep­a­rate site-wide crit­i­cal down­time from indi­vid­ual fea­ture or ser­vice outages.

Here’s a three hour peri­od with a brief inci­dent — web servers in our high­lights ser­vice spiked to 100% CPU for a cou­ple minutes:

For 2016 we’ve set a goal of no more than 120 indi­vid­ual min­utes of down­time and 360 min­utes of degrad­ed ser­vice. We arrived at these thresh­olds by look­ing at pre­vi­ous years and fore­cast­ing while also push­ing our­selves to improve. Admittedly, the tar­get could be quan­ti­fied as a nines” per­cent­age. But count­ing min­utes is more straight­for­ward and eas­i­er to track than a tar­get uptime of 99.977%.

By the Microservice

Hudl is split into small­er microser­vices, each serv­ing a few pages and API end­points. Along with the over­all avail­abil­i­ty described above, we track it for each of these ser­vices inde­pen­dent­ly to help iden­ti­fy con­trib­u­tors to avail­abil­i­ty issues. Regularly look­ing into how each per­forms lets us know where to focus on opti­miza­tion or main­te­nance efforts.

Here’s an exam­ple of a ser­vice that per­forms incon­sis­tent­ly and needs some work:

Stacking these met­rics up side-by-side helps see the ser­vices that are high­er vol­ume and which ser­vices are per­form­ing rel­a­tive­ly bet­ter or worse:

Shortcomings

There are a few things our algo­rithm doesn’t cov­er that are worth not­ing.
The argu­ment could be made that 4XX HTTP respons­es should also be includ­ed as failed requests. We go back and forth on this — we’ve had serv­er-side prob­lems man­i­fest as 404s before, but it’s tough to tell the dif­fer­ence between good” and bad” 404s. We’ve opt­ed to exclude them for now.

It doesn’t cov­er front-end issues (e.g. bad JavaScript) since it’s sole­ly a serv­er-side mea­sure­ment. We have some error track­ing and mon­i­tor­ing for client code, but it’s a part of our sys­tem where our vis­i­bil­i­ty is more limited.

The serv­er-side nature of the mon­i­tor­ing also doesn’t cov­er region­al issues like ISP trou­bles and errors with CDN POPs, which come up from time to time. These issues still affect our cus­tomers, and we do what we can to iden­ti­fy prob­lems and help route around them, but it’s anoth­er vis­i­bil­i­ty gap for us.

Never Perfect, Always Iterating

We’ve iter­at­ed on the algo­rithm sev­er­al times over the years to make sure we hold our­selves account­able to our users. We have alerts on avail­abil­i­ty loss and inves­ti­gate inci­dents pru­dent­ly. If users are hav­ing a rough time and it’s not reflect­ed in our avail­abil­i­ty, we change how we run the num­bers. This is what works well for us today, but I expect it to change as we move for­ward with our sys­tems and our business.