Behind the Scenes of the Bloglines Datacenter Move (Part 1):
One week ago, we moved the Bloglines service from the AT&T datacenter in Redwood City, CA to MCI in Bedford, Massachusetts. This was a challenging and complex undertaking that required months of preparation by many groups. Now that the dust has settled, over the next couple of days I'll explain some of the process involved.
We had been at AT&T since Bloglines first went on-line in June, 2003, and had been very happy with them. AT&T is a tier 1 colocation facility. They aren't the cheapest, but we never had to worry about power outages or other issues that can crop up with other facilities. After we were acquiried by Ask Jeeves in February, we started talking about moving the Bloglines service to the main Ask facility, which is in Massachusetts. This made sense for a number of reasons: it would be easier for operations, it would be easier for us to quickly expand in the future, and it would be easier for us to tie into other parts of Ask Jeeves.
Once the decision was made to move, we had two tasks: figure out how many machines to build out in Bedford, and figure out how to do the move with the minimum amount of downtime. In my experience, estimating how much hardware you'll need at some point in the future can be difficult, especially when you're growing quickly and you don't have a lot of history to use in estimating. I believe in the concept of overwhelming firepower (when in doubt, double or triple it), so we overestimated everything. In the end, the new system has 3 times the number of machines that we were running in Redwood City, and each of those machines is probably twice as fast as any of the old boxes. Once operations had the configurations, they set about ordering, installing, and configuring the machines. That left us with having to figure out how to move the site across the country while minimizing downtime.
Behind the Scenes of the Bloglines Datacenter Move (Part 2):
The simplest (and safest) way to move a site is to take it completely down, copy all the data to the new machines, and then bring the site back up at the new datacenter. We could have done that, but the length of downtime required would have numbered in the days, and we didn't want to do that. Actually, an even simpler way to move a site is to physically take the machines and move them to the new datacenter. Going across country, that still would have required probably 24 hours of downtime, factoring in the time to pull the machines from Redwood City, pack them, put them on an airplane, unpack them, reinstall them in Bedford, and reconfigure them for their new network environment. And after a journey like that, chances are some of the machines wouldn't come back up. So our only real option was to create a system that would copy at least a large amount of our data to the new datacenter in the background, while Bloglines was still live and operating.
The Bloglines back-end consists of a number of logical databases. There's a database for user information, including what each user is subscribed to, what their password is, etc. There's also a database for feed information, containing things like the name of each feed, the description for each feed, etc. There are also several databases which track link and guid information. And finally, there's the system that stores all the blog articles and related data. We have almost a trillion blog articles in the system, dating back to when we first went on-line in June, 2003. Even compressed, the blog articles consist of the largest chunk of data in the Bloglines system, by a large margin. By our calculations, if we could transfer the blog article data ahead of time, the other databases could be copied over in a reasonable amount of time, limiting our downtime to just a few hours.
We don't use a traditional database to store blog articles. Instead we use a custom replication system based on flat files and smaller databases. It works well and scales using cheap hardware. One possibility for transferring all this data was to use the unix utility rdist. We had used rdist back at ONElist to do a similar datacenter move, and it worked well. However, instead, we decided to extend the replication system so that it'd replicate all the blog articles to the new datacenter in the background, while keeping everything sync'ed up. This was obviously a tricky bit of programming, but we decided it was the best way to accomplish the move, and it would give us functionality that we would need later (keeping multiple datacenters sync'ed up, for example).
As the new machines were being built out at Bedford, work started on the blog article replication improvements. In the meantime, we still had a service to run. All growing database-driven Internet services have growing pains. All growing database-driven Internet services have scaling issues. That's just a fact of life. So, in the midst of all this, we couldn't stop working on improving the existing Bloglines site. It made for an interesting juggling effort.
Behind the Scenes of the Bloglines Datacenter Move (Part 3)
As it happened, the new datacenter was built out before the custom blog article replication code was completed and tested. This was ok, because we wanted to stress test the new datacenter machines. After configuring the new machines, we started running some test crawls against an older version of our feed database. To differentiate this test crawler from the Redwood City production crawlers, we changed the User Agent. Many people noticed a crawler with the User Agent "Bloglines/3.0-rho", and some speculated that rho were the initials of one of the engineers. Actually, rho in this case is the greek letter. We didn't want to call it a beta, because it wasn't really, so we went down the greek alphabet. Rho is greater than beta, you see. Yes, we're easily amused.
The replication code started to stabilize, and we began copying blog articles from the old datacenter to the new one. This happened in fits and starts as we debugged the code. The fact that it happened without us having to take the site down was a great advantage. We also continued to test the Bloglines installation at the new datacenter.
Concurrently, we started working out the datacenter move checklist, enumerating all the items that had to be completed, and at which point. The blog articles were being copied in the background, but all the other databases in the system could only be copied when we could be assured that they wouldn't be updated (ie. they were operating in read-only mode). With Bloglines, we could "cheat" a little. By turning off the crawlers in Redwood City, we could assure that many of the databases in the system would not be modified, while still keeping the site alive. We could then start copying these databases, and the total amount of downtime would be reduced further. So our move checklist was divided up into the following sections:
- Tasks that had to be completed before the day of the move
- Tasks to do after the crawlers were turned off
- Tasks to do after the site was taken down
- Verification steps after everything was moved to the new datacenter
- Tasks to do after the site was back up at the new datacenter
When we were reasonably confident in the blog article replication code and we had worked out a reasonably complete move checklist, we set a date for the datacenter move three weeks hence, the evening of Friday December 16. Friday evenings are the slowest, traffic wise. And that would give us an entire weekend to fix any issues that arose during the transfer. Seeing how the migration didn't actually happen until Monday December 19, it's safe to assume that some issues came up during the intervening time.