Much like flying, migrating database servers involves hours and hours of boredom punctuated by moments of excitement and despair.
It’s been a long couple of days getting things to work, but we hope that this is going to help simplify and smooth operations going forward.
I’m not going to get into the nitty gritty of why it took more than twice as long as planned unless you’d like me to. In short, the database got a little corrupted by Zeus unexpectedly hurling thunderbolts at our data centre in Virginia. We thought we’d repaired it. But whenever we tried to move things from old to new server, they failed in unexpected and varying ways. We eventually had to adopt a much more manual, laborious and piecemeal strategy, which was what took much of today.
We haven’t yet had our team postmortem, but I can already see a number of things that came out of this:
- We plan to rehearse for and guard against failures better. Having a catastrophic data center failure at the same time we were attempting major maintenance feels like having a tsunami hit you while you’re trying to build the Channel Tunnel. It’s bad luck, but we should have been better prepared.
- It is agonizing having the site be down for minutes, let alone hours. But, we didn’t lose anyone’s learning data.
- We need to do a better job with status updates. I always worry that as soon as I announce that we’re making progress, something will go wrong immediately afterwards and then we’ll have to eat our words. There’s got to be a balance between staying in close touch without over-promising…
- Getting to talk to all of you on the blog while we twiddled our thumbs was really generative, and huge fun – like the spontaneous night-time street parties that erupted in New York during the colossal power outage a few years ago…
- I’m very optimistic that the database heavy lifting we’ve had to work on manually in the past will be much easier now. Over the next few months, I hope this will lead to less downtime, improving performance even as we grow, less time spent on system administration, and more time on making Memrise better!
One last bonus for anyone reading this far. I’ve set up a public Google Hangout called Chat to the Memrise team. I’ll be up for another hour or so (till 3 or 3.30am GMT Monday/Tuesday night) – pop in and say hello, harangue me about why it took so long, point out problems, or we can just keep one another company as we water our plants 🙂
One last time – sorry for the downtime, and thank you sincerely for your patience and good cheer!
Greg Memrise CTO