11May/11
Amazon’s recent EBS outage has once again shown how important it is to architect your application to tolerate failure. This post is the second in a series focused on giving you tips to have Scalr handle failure tolerance for you.
How to survive a Region outage
We previously saw how Scalr could help you survive an Availability Zone going down, which boiled down to sticking to defaults so Scalr can take care of it for you. We also defined the concepts of Failure Tolerance and of Failure Recovery, the former maintaining availability through failure, the latter maintaining your ability to recover from it.
Surviving a Region outage is a different beast altogether, as you face larger problems:
But first, a little theory…
If a Region is going to be unavailable, you’ll want to make sure you still have access to the three tenets of your infrastructure: your application, your configuration, and your data. Machine Images (images) store both configuration and application, and sometimes data, but you need to be careful and aware that it might have been a while since your last snapshot, and your images might have changed a lot since. If you are using images to store application, configuration, and data, make sure you snapshot your instances after every application update and configuration change, and snapshot them regularly if they contain data (such as database data and user contributed data, like blog post images).
Of course, there are drawbacks to this method, the first of which is the tedium of creating images for all the different Regions you want to operate in, and the second of which is the difficulty of keeping them all in sync and identical. Because of this, it is better practice (at scale) to separate the tiers, like you would separate load balancers, caching servers, web servers, and database servers: keep your application in a code repository, your configuration in Chef recipes or Puppet modules, and your data on some persistent store like S3. You can then use these to recreate your infrastructure at a moments notice.
Still following? Good! Here comes the juicy part.
Region Failure Recovery
The easiest way to survive Region outages is to keep a database slave in a separate Region (used to continuously replicate data from your main database), and make sure your images are available in both Regions. Lets say you operate out of us-east, and keep your load balancer, web servers, etc. there. You read this article, and diligently set up a slave in us-west. Now us-east goes down, uh-oh. Well, not so uh-oh since you have your application, configuration, and data available in us-west. Edit the us-west farm that contains the slave, and add all the components that your application requires (or change the max-instances value to something >0 for each role if you did so already). Now update your DNS zone in Scalr to the new load balancer, and Scalr will update your A records so traffic goes to your new infrastructure.
You have successfully recovered from failure, and your site is up and running!
Disadvantages of this method? It’s fairly manual, and it can be expensive to keep a spare instance running just for backup.
Doing it on the Cheap
Could you get away with a micro instance as the slave? Depending on the write rate, applying binary logs to actual data can be pretty disk and cpu intensive, and the micro instance might start falling behind (increasing value of seconds_behind_master). If that happens and disaster strikes, you’ll be missing some data that you can only recover manually, and only when the offline Region is available again. Up to you to decide where you stand in the cost vs data-completion tradeoff.
Could you get away with a less manual recovery? Unfortunately, automatic region outage detection is very complicated, and automatic assessment of amplitude and duration even harder. This results in a significant chance of false positives which, combined with master-slave replication not being easily reversible, makes it safer to keep a manual process.
Region Fault Tolerance
What about Failure Tolerance? If you want to go all-out and not care about Regions going up and down, full blown master-master replication is a good option. Create two (or more) farms that include mysql instances (choose big fat ones, like the 32GB ones), and configure replication between them: this is known as master-master replication. Then let each replicate separately to their slaves in usual Scalr manner. Remember to use MySQL’s key offsets to avoid running into primary key collisions. If you have two masters, you can set the first to only create even primary keys and the second to only create odd keys.
Worse case scenario
If a Region becomes unavailable and you forgot to prepare for the eventuality, you can always create a new farm and load data from the last backup made for you. When the Region comes back up you can reconcile the differences.
How Scalr helps
What Scalr is working on to make this easier
Monitoring and alerting. Starting with the next release, we’ll allow you to set up monitors and alerts so you can get notified when Scalr adds or removes capacity for you, but also when bad things happen. Like if all instances in a Region are inaccessible (sign of a Region outage). If we get around to it, we’ll also add some aggregate intelligence, so you can compare your infrastructure to the aggregate (is it me or is it every Scalr user?).
Different datastores. We’re adding MongoDB, restoring Memcache, and continuing to work on Cassandra to give you more options for storing and querying your data, and being able to access it despite outages.
Regular snapshotting of instances. We advise against this, as replacing instances automatically can result in lost data, but creating snapshots without replacement (backups essentially) can be useful. Looking into it.
Easier set up of replication. We’re looking into making it easier to set up slave replication on servers that are not part of the same farm, for that lone slave server on another Region
Master-Master. This has been asked many times now, but it has a tendency to be brittle and we fear the costs of supporting it.
Farm cloning. We’re adding the ability to clone a farm so you can deploy copies of it in different Region. These clones will be complete with data and configuration, so Dev/Test is a natural fit too.
Next in this series: How to survive a Cloud outage, and How to survive degraded functionality
Thinking about taking our site to the cloud and scalr really has given me a lot to think about… Thanks guys… And yeah, those outage recommendations are really good… adding them to the instances I created
Just found scalr, looks very cool. I do inter-region failover on AWS via asynchronous binlog shipping through S3 with scheduled replays in another region. The cost is dependent on how much database changes happen, but so far this is much less than the cost of running a single instance full time. There are some gotcha’s with asynchronous binlog replay, so logs have to be monitored for errors, and anyone writing the SQL code needs to know not to do things that are not async-replication friendly (like using temp tables).
Keep up the good work, I’ve added scalr to my list of things to evaluate further.