There and Back Again: How a Seemingly Well-Planned Server Move Crashed, Burned, and Rose from the Ashes


Photo by hisperati

About 8 months ago I acquired a small startup called HitTail. You can read more about the acquisition here.

When the deal closed, the app was in bad shape. Within 3 weeks I had to move the entire operation, including a large database, to new servers. This required my first all-nighter in a while. Here is an excerpt of an email I sent to a friend the morning of September 16, 2011 at 6:47 am:

Subject: My First All Nighter in Years

Wow, am I tired. Worst part is my kids are going to be up in the next half hour. This is going to hurt :-)

But HitTail is on a new server and it seems to be running really well. Feels great to have it within my control. There are still a couple pieces left on the old server, but they are less important and I’ll have them moved within a week.

I’ll write again in a few hours with the whole story. It’s insane how many things went wrong.

What follows is the tale of that long night…

The Setup
I acquired HitTail in late August and it was in bad shape. I had 3 priorities that I wanted to hit hard and fast, in the following order:

  1. Stability – Before I acquired it the site went down every few months, sometimes for several days on end. To stabilize it I needed to move it to new servers, fix a few major bugs, and move 250GB of data (1.2 billion rows).
  2. Plug the Funnel – Conversion rates were terrible; I needed to look at each step of the process and figure out how to improve the percentage of people making it to the next step.
  3. Spread the Word – Market it.

The first order of business was to move to new servers, which involved overnighting 250GB of data on a USB drive to the new hosting facility, restoring from backup, setting up database replication until the databases were in synch, taking everything offline for 2-3 hours to merge two servers into one, test everything, and flip the switch.

If only it were that simple.

Thursday Morning, 10am: An Untimely Crash
We had planned to take the old servers offline at 10pm pacific on Thursday, but around 10am one of them went down. The server itself was working, but the hard drive was failing and it wouldn’t server web requests.

I received a few emails from customers asking why the server was down and I was able to explain that this should be the last downtime for many months. Everyone was appreciative and supportive. But I spent most of the day trying to get the server stay alive for 12 more hours, with no luck. Half of the users had no access to it for the final 12 hours on the old servers.

Thursday Afternoon, 2pm: Where’s the Data?
We had meticulously planned to have all the data replicated between the old and new database to ensure they would be in synch when we went to perform the migration. But there was one problem…

Replication had silently failed about 48 hours beforehand, and neither myself or my DBA noticed. So at 2pm we realized we were literally gigs behind with the synchronization. With only 8 hours until the migration window began, we zipped up these gigs and started copying them from server to server. The dialog said it would take 5.5 hours – no problem! We had 2.5 hours of leeway.

Thursday Night, 8pm: Copy and Paste
My DBA checked on the copying a few times in the middle of his night (he’s in the UK), and it eventually failed with 10% remaining. At that point we knew that even if we got everything done according to plan we wouldn’t have data for the most recent 48 hours. Grrr.

In a last ditch effort to get the data across the wire we selected the info from a single table that changes the most and copied it to the new server, which took only around 5 minutes. The problem was that this data should have been processed by our fancy keyword suggestion algorithm and it hadn’t been.

Friday Morning, 1am: Panic Sets In
It was about this time I began to panic. Trying to process the data using existing code was not going to happen – the algorithm is written in a mix of JavaScript and Classic ASP, which couldn’t be executed in the same fashion that it is when the app runs under normal conditions. And it wouldn’t run fast enough to process the millions of rows that needed it in any realistic amount of time.

So I did one of the craziest things I’ve done in a while. I spent 4 hours, from approximately 11pm until 3am, writing a Windows Forms app that combined the JavaScript and Classic ASP into a single assembly – with the JavaScript being compiled directly into a .NET DLL that I referenced from C# code. Then I translated the Classic ASP (VBScript) into C# and prayed it would work.

And after 2 hours of coding and 2 hours of execution, it worked.

Friday Morning, 3am: It Gets Worse…
During this time the DBA was trying to merge two databases – copying 75 million rows of data from one DB to another. This was supposed to take 3-5 hour based on tests the DBA had run.

But it was taking 20x longer. The disks were insanely slow because we were running during their backup window and some of the data was on a shared SAN. By 3am, when we should have been wrapping up the entire process, we were 10% done with the 75 million rows.

By 6am we had to make a call. It was already 9am on the East Coast and customers would surely be logging in soon if they hadn’t already. I was exhausted and at my wits end with the number of unexpected failures, and I asked my DBA what our options were.

After some discussion we decided to re-enable the application so users could access it, but continue copying the 75 million rows, but with a forced sleep so the app could run with surprisingly little performance impact. It took 4 days for all of the data to copy, but we didn’t receive a single complaint in the meantime.

Lucky for us, no one noticed. And the app running on the new hardware was 2-3x times more responsive than on the old setup.

Friday Morning, 6:45am: Victory
After some final testing I logged off at 6:45am Friday morning with a huge sigh of relief. I dashed off the email you read in the intro, then headed to bed to be awoken by my kids an hour later. Needless to say had a big glass of wine the following night.

Building a startup isn’t all the fun you read about on Hacker News and TechCrunch…sometimes it’s even better.

Start Small, Get Big
The Newsletter for Self-Funded Startups. It'll Change Your Life.
What you get for signing up:
  • A 170-page ebook collecting my best startup articles from the past 5 years
  • Previously unpublished startup-related screencasts
  • Exclusive revenue-growing techniques I don't publish on this blog
"The ideas and information Rob provides should be required reading for anyone that wants to create a successful business on the web." ~ Jeff Lewis
Startups for the Rest of Us...
If you're trying to grow your startup you've come to the right place. I'm a serial web entrepreneur here to share what I've learned in my 11 years as a self-funded startup founder. Luckily several thousand people have decided to stick around and join the conversation.

For more on why you should read this blog, go here.

7 comments ↓

#1 Andrew Youderian on 06.06.12 at 3:37 pm

Wow, what a nightmare! Glad you were able to get everything moved over properly. Moving servers – even for small, basic apps – is usually error prone. I can’t imagine the process you went through.

I’m a recent reader, and subscriber to your Podcast! Been enjoying ‘Startups for the Rest of Us’, and just finally got around to giving you a 5-star rating. Hope you guys are still ahead in your contest!

Missed Microconf this year, but have plans to make it next year if you guys do it again….

#2 Tom Mollerus on 06.06.12 at 4:29 pm

Wow, that was a huge effort by you and your DBA. Was there a reason you didn’t perform the migration during a weekend, when your customers might have been less affected by any downtime?

Rob Reply:

Weekend have more backups running so the disk would have been even slower. Also, given that the site had been going down every 2-3 months, some planned downtime for upgrades was not a big deal to the users that remained.

BTW – I’m glad we decided to do it that Thursday evening instead of waiting until the weekend, given that the server crashed Thursday morning.

#3 Progress Paddy on 06.07.12 at 7:04 am

This sounds scary.

It’s at those times that I’m sure some other people in your shoes would have given up and allowed lots of data to have been lost.

#4 David Urmann on 06.08.12 at 12:13 pm

Sounds like a Nightmare. I have a few words of advice to add –
#1 ) Assume from the very start of the problem that its going to be worse then you expect and start thinking of all the options on the table.
#2) Never give up – work the problem from every angle. If you have the staff work it from different angles at the same time if possible.
#3) Have a backup plan already in place – If a mirror of everything is set up on another server you can change the DNS and at worse you are only down propagation time.

Great post.

#5 emile on 06.11.12 at 5:54 pm

After all of this it looks like fun, but in the middle it’s like hell on Earth. I have been to similar situations. I am glad you got it done. Good Job.

#6 Jeff Huckaby on 06.20.12 at 9:57 am

I feel your pain. We had a 2TB RAID 5 array with a disk failure. We had the maintenance window schedule for that night when another disk failed causing a full RAID failure. There was about 1TB of data we had to pull from backups. Not fun.

I’ve also had a physical data center move where the provider failed to switch IPs to the new facility. Everything was in place, all 15 servers up and running, but no network connectivity. We had to go to the old facility and put in a proxy to the new one until the network guys got the routing sorted.

In all cases, keep your clients informed. We use Tumblr, Facebook and Twitter for when our own operations are out of whack.