Migrating Millions of Users in Broad Daylight

In August we migrat­ed our core user data (around 5.5MM user records) from SQL Server to MongoDB. We moved the data dur­ing the day­time while still tak­ing full pro­duc­tion traf­fic, main­tain­ing near­ly 100% avail­abil­i­ty for reads and writes dur­ing the course of the migra­tion. Our CPO fit­ting­ly described it as akin to swap­ping out a cou­ple of the plane’s engines while it’s fly­ing at 10,000 feet.” I’d like to share our approach to the migra­tion and some of the code we used to do it.

Migrating Millions of Users in Broad Daylight

In August we migrat­ed our core user data (around 5.5MM user records) from SQL Server to MongoDB. We moved the data dur­ing the day­time while still tak­ing full pro­duc­tion traf­fic, main­tain­ing near­ly 100% avail­abil­i­ty for reads and writes dur­ing the course of the migra­tion. Our CPO fit­ting­ly described it as akin to swap­ping out a cou­ple of the plane’s engines while it’s fly­ing at 10,000 feet.” I’d like to share our approach to the migra­tion and some of the code we used to do it.

In August we migrat­ed our core user data (around 5.5MM user records) from SQL Server to MongoDB. The migra­tion was part of an ongo­ing effort to reduce depen­den­cy on our mono­lith­ic SQL Server instance (which is a sin­gle point of fail­ure for our web appli­ca­tion), and also iso­late data and oper­a­tions with­in our sys­tem into small­er microser­vices.

The migra­tion was a daunt­ing task – our user data sees around 800 operations/​second and is crit­i­cal to most requests to our site. We moved the data dur­ing the day­time while still tak­ing full pro­duc­tion traf­fic, main­tain­ing near­ly 100% avail­abil­i­ty for reads and writes dur­ing the course of the migra­tion. Our CPO fit­ting­ly described it as akin to swap­ping out a cou­ple of the plane’s engines while it’s fly­ing at 10,000 feet.” I’d like to share our approach to the migra­tion and some of the code we used to do it.

The Big Picture

Our user records have IDs that are numer­ic and sequen­tial. Starting at ID 1, users are migrat­ed in ranges of 1000 IDs at a time, mov­ing sequen­tial­ly upward through all user IDs (from 1 to about 5.5MM). The crit­i­cal state dur­ing the migra­tion is the point” user ID. All reads and writes for user IDs below the point (i.e. migrat­ed users) go to MongoDB; oper­a­tions to unmi­grat­ed IDs above the point con­tin­ue going to SQL.

For the range being migrat­ed (the point user ID + 999), write oper­a­tions are locked; code along all write paths throws an excep­tion if the write is for a user ID in the locked range. This guar­an­tees that requests are nev­er lost” by send­ing writes for already-migrat­ed users to SQL. Reads for the migrat­ing range still go to SQL and return results, and all users in the migra­tion range are con­sid­ered pend­ing” until every record is com­plete­ly migrat­ed and the point user ID advances to the begin­ning of the next range. A batch of 1000 users takes about 500ms to migrate, so the prob­a­bil­i­ty of a write request col­lid­ing with a user being migrat­ed is low.

Code Refactoring, Conditional Routing

A lot of prep work went into refac­tor­ing appli­ca­tion code to intro­duce con­di­tion­al behav­ior that checks user IDs to see if they’ve been migrat­ed or not, invok­ing SQL or MongoDB where appropriate.

Hudl’s web appli­ca­tion code is mul­ti-lay­er; it has a domain lay­er (with inter­faces like IUserDomain and IUserUpdateDomain) that encap­su­lates busi­ness log­ic, val­i­da­tion, etc., and the domain imple­men­ta­tions call a data lay­er of DAO class­es that work direct­ly with the database(s). Web/​MVC com­po­nents (e.g. con­trollers and ser­vice end­points) sit above the domain lay­er. Refactoring this meant:

Introducing the new MongoDB data lay­er and domain code to sup­port the same behav­ior as the exist­ing SQL domains.

Adding a Migration Resolver” (MigrationResolver.cs) that decides which domain imple­men­ta­tion to return for the request­ed user ID(s), and re-rout­ing all high­er-lay­er code (e.g. con­trollers, APIs, etc.) through the resolver instead of (the pre­vi­ous behav­ior) direct­ly inter­fac­ing with the domain layer.

Introducing a Migration State” class (UserMigrationState.cs) as the source of truth about where the migra­tion is at (i.e. the point user, and whether a batch is locked for migration).

Post-refac­tor, here are a few exam­ples of how dif­fer­ent user oper­a­tions behaved:

GetUserById(123) — MigrationResolver.cs (line 118)

The resolver checks the state sin­gle­ton for the pro­vid­ed user ID. If the user is migrat­ed, the resolver returns the MongoDB domain; oth­er­wise it returns the SQL domain.

GetUsersByIds({ 123, 456, 1200, 13457 }) — HybridUserLookupDomain.cs (line 60)

The resolver returns a hybrid domain that uses the state sin­gle­ton to split request­ed IDs into migrat­ed and unmi­grat­ed sub­sets, then query­ing the MongoDB domain for the for­mer and SQL domain for the lat­ter. Results from both are com­bined, sort­ed, and returned.

GetUserByEmail(“rob@example.com”) — HybridUserLookupDomain.cs (line 40)

The resolver returns a hybrid domain that queries MongoDB for the email first; if found, it’s returned, oth­er­wise SQL is queried for the same email.

Consistent lay­er­ing and ver­ti­cal sep­a­ra­tion between SQL and MongoDB code real­ly helped us. By guar­an­tee­ing that every­thing went through the resolver and that all of our SQL and MongoDB oper­a­tions were iso­lat­ed and sep­a­rate, we could con­fi­dent­ly know that we didn’t have stray SQL calls scat­tered through­out the code, reduc­ing the poten­tial that we missed or for­got one as we refac­tored and intro­duced our con­di­tion­al rout­ing. It helped when intro­duc­ing migra­tion-spe­cif­ic code like our write-lock behav­ior in just a few places, giv­ing us con­fi­dence that we had every­thing cov­ered. If you’re tack­ling a sim­i­lar effort, con­sid­er some up-front work (inde­pen­dent of the migra­tion effort itself) to improve your code lay­er­ing if you don’t have it already.

The Migration Job

A cus­tom, inter­ac­tive con­sole appli­ca­tion (UserMigrationJob.cs) serves as the con­trol process for the entire job. It pro­vides com­mands to migrate sin­gle users or man­u­al ranges for test­ing and gives clear out­put about migra­tion state and suc­cess. It’s writ­ten to cau­tious­ly expect suc­cess – if it ever encoun­ters a response or state that it’s unfa­mil­iar with, it aborts the migra­tion and leaves the sys­tem in a known, sta­ble state, print­ing out diag­nos­tic infor­ma­tion to help fix the sit­u­a­tion and resume after mak­ing corrections.

Here’s an ear­ly run of the job in our stag­ing envi­ron­ment (the batch­es take a bit longer to migrate since our stage hard­ware is less powerful):

Our pri­ma­ry pro­duc­tion web­servers (about 20 c4.xlarge EC2 instances) serve as the work­hors­es for the migra­tion; user ids are split into batch­es and par­al­lelized across them. Using the web­servers is con­ve­nient because they already have all of the code need­ed to inter­act with both SQL and MongoDB, so there’s no need to copy or dupli­cate data lay­er code to the job itself. It also makes it easy to par­al­lelize the job, since it just means split­ting a range of user IDs across all the servers.

It does mean throw­ing some extra load at apps tak­ing pro­duc­tion traf­fic, but that’s mit­i­gat­ed by being pre­pared with our sys­tems mon­i­tor­ing, log­ging, and our abil­i­ty to throt­tle or stop the job if necessary.

The migra­tion job itself starts at user ID 1 and works upward in batch­es of 1000. For an iter­a­tion migrat­ing a range of users, it:

  • Splits the batch into batch size / N webservers.
  • Makes an HTTP request to set the migra­tion state on each web­serv­er, which sets the point user ID and locks writes for the entire range about to be migrated.
  • Sends one sub-batch to each web­serv­er. The web­servers query the batch of users from SQL, insert them into MongoDB, and heav­i­ly ver­i­fy the migra­tion by re-query­ing both data­bas­es and com­par­ing results. The web­serv­er responds with a sta­tus and some result data.
  • Makes an HTTP request to each web­serv­er to move the point user ID and unlock the batch. The point is moved up to the next unmi­grat­ed user ID.
  • Repeats.

If at any point a step fails or we don’t see a syn­chro­nous state across the servers, the job stops itself, aban­dons the cur­rent batch, unlocks writes, and echoes the cur­rent state for man­u­al intervention.

Having the migra­tion state muta­ble and stored in mem­o­ry on the web­servers makes the sys­tem a bit vul­ner­a­ble. The state is crit­i­cal for rout­ing, and if it’s ever out of sync it means we’re poten­tial­ly cor­rupt­ing data by send­ing it to the wrong data­base. To mit­i­gate this, the job per­sists the state to a data­base after each batch is migrat­ed, and the webapps are cod­ed to read the state on start­up or pre­vent start­up com­plete­ly if they can’t. The job is also adamant about ensur­ing con­sis­ten­cy of the state, pre­fer­ring to keep it cor­rect­ed and avoid con­tin­u­ing the migra­tion if things aren’t as they should be.

Go Time

When every­thing was ready to go, we came in on a Monday morn­ing and turned on the con­fig­u­ra­tion tog­gle to start send­ing new user cre­ation to MongoDB and let that sim­mer for a lit­tle while, keep­ing an eye on met­rics and logs for any­thing unex­pect­ed. During this and the remain­der of the migra­tion, we were in close con­tact with our awe­some sup­port team’s tech leads – they were plugged in on where we were at, and were look­ing out for any odd sup­port calls or behav­ior that trend­ed with the ranges of IDs for new users or users that had been migrated.

We start­ed the actu­al migra­tion by mov­ing users 1 – 9999 on Monday after­noon. Again mov­ing cau­tious­ly, we mon­i­tored the sys­tems for the rest of the day and overnight before com­mit­ting to a larg­er-scale run.

On Tuesday morn­ing we ran the migra­tion on users 10000 – 99999. This longer, sus­tained run let us under­stand how the load of the migra­tion would impact the web­servers and data­bas­es we were work­ing with. After push­ing the migra­tion up to user 499999, we stopped again to mon­i­tor and observe. We did uncov­er a cou­ple minor bugs in our domain lay­er at this point, so we spent the remain­der of the day cod­ing, test­ing, and deploy­ing those fixes.

Wednesday after lunch, we pulled the trig­ger and ran the migra­tion on the remain­ing users, which took around an hour to com­plete. Nobody real­ly even noticed we were migrat­ing some of Hudl’s most crit­i­cal data right out from under them; it was a typ­i­cal Wednesday after­noon – and that’s exact­ly how we want­ed it.

Wrapping Up, and a Few (of Many) Lessons Learned

After com­plet­ing the migra­tion, we saw our SQL Server steady state CPU uti­liza­tion drop from 25% to 15%. That was a pret­ty big win for us. Additionally, mov­ing the data and code to its own microser­vice gave us a great bound­ed con­text to work with­in as we go for­ward, keep­ing it iso­lat­ed and mak­ing changes low­er-risk and eas­i­er to deploy.

One thing that’s real­ly easy to do dur­ing migra­tions is let lit­tle changes” creep in. You see some code that could use a lit­tle cleanup, or find a data type or method sig­na­ture that could be improved a bit, and it’s easy to say, oh hey, I’ll just fix that now while I’m in there”. My advice: don’t. Every lit­tle change adds risk and addi­tion­al test­ing to some­thing that’s already inher­ent­ly full of risk. Make a note of those things and change them after you’re done with the migra­tion. Trust me, it’ll keep you sane.

We start­ed the project with­out ded­i­cat­ed QA, and that real­ly hurt us in the long run; we crammed a bunch of test­ing late in the effort. Try to get QA involved ear­ly and all the way through the process. On top of that, hav­ing good com­mu­ni­ca­tion with our sup­port team dur­ing the process helped us stay on top of any cus­tomer-fac­ing issues that we didn’t notice with our met­rics. Bottom line: don’t go too far off the grid. Being heads-down is impor­tant, but stay con­nect­ed to keep the feed­back coming.

Finally, make sure you’re mon­i­tor­ing every­thing you can. For us, that includ­ed things like:

  • MongoDB oper­a­tions
  • CPU (and oth­er sys­tem met­rics) on web­servers, SQL Server, and the new MongoDB systems
  • User logins, access­es, updates, cre­ates, etc. (for error rates, vol­ume, and performance)
  • Application logs (we use SumoLogic to aggre­gate), includ­ing many that we’d specif­i­cal­ly added for the migration

It was an excit­ing, chal­leng­ing (and at times gru­el­ing) project. I’m real­ly proud of the team and all the time and effort put in to make it so suc­cess­ful. There’s so much more I could cov­er here, so if you’d like more detail or insight into some­thing, hit us up on Twitter at @HudlEngineering.