I’m the Memrise CTO, and I wanted to apologize for and explain the downtime recently. I’m really sorry about it. Here’s roughly what happened.
My day started with a phone call at around 5am GMT from Ben, our Content & Community Manager in Beijing, concerned because Memrise.com appeared to be unavailable.
We tried to contact our servers, but they were as surly and silent as a teenager at dinner. It turns out that our data centre was under attack from a massive electrical storm, which caused a big power failure, bringing down all our servers. We’ll talk later about what we’re doing to avoid this happening again.
It took hours and hours for all of our servers to reawaken, but they all did eventually. So far so good.
It was then that Evan (in California) realized that some of the hard disks had been corrupted by being shut down so unceremoniously. This isn’t ever supposed to happen, because we store our data specially on twin mirror-image hard disks, so that if one fails, you always have the brother to fall back on. [In fact, this same tactic was used by Tudor kings, and explains how Henry VIII ended up being married to his late brother’s betrothed.]
And here’s the thing about modern databases: they write out all the work they do to a special log, as a safeguard, so they’re designed to survive pretty much anything. Even so, I was nervous. If we weren’t able to easily repair the corruption, we’d have had to decide whether to keep the site down for longer while we looked for solutions, or rewind back to our most recent backup from yesterday.
Fortunately, our repair steps finished running around lunchtime and looked promising, so I’m pretty sure that all the safeguards worked as they were supposed to. Great, devil’s alternative skirted – so we were potentially ready to bring the site back up.
One last dilemma though – we had coincidentally scheduled a major maintenance period for this weekend. We hate the site being down – it’s like dead air to a DJ, it’s that feeling of angry butterflies in your stomach, buzzards in the sky, of black cats crossing your path and peeing on your bare feet. I hate it when the site is down, and I know you do too. But in the end, we decided to go straight into the scheduled day of maintenance immediately, get it all done, and get the site back up permanently as soon as possible.
So, that’s what’s happening right now. We should be done in about 12 more hours (mid-afternoon UK time, early morning California time). I can’t wait. We’ll keep you posted if there are any delays.
UPDATE at 5pm GMT on Sunday: Loading in the database on the new server failed as a result of the corruption we experienced during the power failure on Friday. Very disappointing. We’re really keen to get things up and running again asap, so we’re now concurrently trying to load in the database in a few different ways on multiple servers. As soon as any of those strategies finish, we’ll be ready to go ahead. In short though, it’s going to be hours and hours waiting for this to finish, i.e. morning or afternoon GMT, before we’re live.
UPDATE at 2am GMT on Sunday/Monday: the corrupted database is still styming us a little, so we’ve switched to a more careful but laborious tactic where we load in the data piecemeal. There are still some roadblocks, but we’re hoping to be ready by the time most of you wake up on Monday.
Update at 3.30pm GMT on Monday: everything is proceeding, but it’s taking longer and longer as the new database fills up. Rather than make another projection, I’ll just say that we’re doing everything we can to usher things along as fast as possible, with the plan that we’ll be back up sometime today.
UPDATE at 11.45pm GMT on Monday: we’ve managed to load in all the big tables. But we had to do it in a more piecemeal, laborious fashion, and that has left a few glitches we still need to iron out. If they prove easy, we’ll be done soon. If they prove complicated, it’ll be longer.