11May/11

2

Using Scalr to avoid future Amazon problems: Surviving Region outages

Amazon’s recent EBS outage has once again shown how important it is to architect your application to tolerate failure. This post is the second in a series focused on giving you tips to have Scalr handle failure tolerance for you.

How to survive a Region outage

We previously saw how Scalr could help you survive an Availability Zone going down, which boiled down to sticking to defaults so Scalr can take care of it for you. We also defined the concepts of Failure Tolerance and of Failure Recovery, the former maintaining availability through failure, the latter maintaining your ability to recover from it.

Most often, downtime is caused by human errors

Surviving a Region outage is a different beast altogether, as you face larger problems:

  • Network speeds: your data might not get from your web server to your database in low enough time, aka latency, which results in slower page load times, or your databases might not replicate data in fast enough volume (data not being present yet), aka throughput, which results in your data not being consistent
  • Increased cost of maintaining two running copies of your website, especially at the low end (as your traffic grows and you have enough for each copy, this issue disappears)
  • EBS limitations: volumes are limited to Availability Zones (AZs) and snapshots to Regions, so you can’t use the same inexpensive magic to create volumes in another Region

But first, a little theory…

If a Region is going to be unavailable, you’ll want to make sure you still have access to the three tenets of your infrastructure: your application, your configuration, and your data. Machine Images (images) store both configuration and application, and sometimes data, but you need to be careful and aware that it might have been a while since your last snapshot, and your images might have changed a lot since. If you are using images to store application, configuration, and data, make sure you snapshot your instances after every application update and configuration change, and snapshot them regularly if they contain data (such as database data and user contributed data, like blog post images).

Of course, there are drawbacks to this method, the first of which is the tedium of creating images for all the different Regions you want to operate in, and the second of which is the difficulty of keeping them all in sync and identical. Because of this, it is better practice (at scale) to separate the tiers, like you would separate load balancers, caching servers, web servers, and database servers: keep your application in a code repository, your configuration in Chef recipes or Puppet modules, and your data on some persistent store like S3. You can then use these to recreate your infrastructure at a moments notice.

You'll have to decide how to manage app, config, and data

Still following? Good! Here comes the juicy part.

Region Failure Recovery

The easiest way to survive Region outages is to keep a database slave in a separate Region (used to continuously replicate data from your main database), and make sure your images are available in both Regions. Lets say you operate out of us-east, and keep your load balancer, web servers, etc. there. You read this article, and diligently set up a slave in us-west. Now us-east goes down, uh-oh. Well, not so uh-oh since you have your application, configuration, and data available in us-west. Edit the us-west farm that contains the slave, and add all the components that your application requires (or change the max-instances value to something >0 for each role if you did so already). Now update your DNS zone in Scalr to the new load balancer, and Scalr will update your A records so traffic goes to your new infrastructure.

You have successfully recovered from failure, and your site is up and running!

 

Success! You recovered from a Region failure!

 

Disadvantages of this method? It’s fairly manual, and it can be expensive to keep a spare instance running just for backup.

Doing it on the Cheap

Could you get away with a micro instance as the slave? Depending on the write rate, applying binary logs to actual data can be pretty disk and cpu intensive, and the micro instance might start falling behind (increasing value of seconds_behind_master). If that happens and disaster strikes, you’ll be missing some data that you can only recover manually, and only when the offline Region is available again. Up to you to decide where you stand in the cost vs data-completion tradeoff.

Could you get away with a less manual recovery? Unfortunately, automatic region outage detection is very complicated, and automatic assessment of amplitude and duration even harder. This results in a significant chance of false positives which, combined with master-slave replication not being easily reversible, makes it safer to keep a manual process.

Region Fault Tolerance

What about Failure Tolerance? If you want to go all-out and not care about Regions going up and down, full blown master-master replication is a good option. Create two (or more) farms that include mysql instances (choose big fat ones, like the 32GB ones), and configure replication between them: this is known as master-master replication. Then let each replicate separately to their slaves in usual Scalr manner. Remember to use MySQL’s key offsets to avoid running into primary key collisions. If you have two masters, you can set the first to only create even primary keys and the second to only create odd keys.

Worse case scenario

If a Region becomes unavailable and you forgot to prepare for the eventuality, you can always create a new farm and load data from the last backup made for you. When the Region comes back up you can reconcile the differences.

How Scalr helps

  • Scalr updates your DNS zone to make manual switchover painless
  • Scalr automatically backs up your mysql data
  • Scalr auto-scales capacity to accomodate redirected traffic
  • plus everything from the previous post

What Scalr is working on to make this easier

Monitoring and alerting. Starting with the next release, we’ll allow you to set up monitors and alerts so you can get notified when Scalr adds or removes capacity for you, but also when bad things happen. Like if all instances in a Region are inaccessible (sign of a Region outage). If we get around to it, we’ll also add some aggregate intelligence, so you can compare your infrastructure to the aggregate (is it me or is it every Scalr user?).

Different datastores. We’re adding MongoDB, restoring Memcache, and continuing to work on Cassandra to give you more options for storing and querying your data, and being able to access it despite outages.

Regular snapshotting of instances. We advise against this, as replacing instances automatically can result in lost data, but creating snapshots without replacement (backups essentially) can be useful. Looking into it.

Easier set up of replication. We’re looking into making it easier to set up slave replication on servers that are not part of the same farm, for that lone slave server on another Region

Master-Master. This has been asked many times now, but it has a tendency to be brittle and we fear the costs of supporting it.

Farm cloning. We’re adding the ability to clone a farm so you can deploy copies of it in different Region. These clones will be complete with data and configuration, so Dev/Test is a natural fit too.

Next in this series: How to survive a Cloud outage, and How to survive degraded functionality

2 Responses to Using Scalr to avoid future Amazon problems: Surviving Region outages

  1. Ian says:

    Thinking about taking our site to the cloud and scalr really has given me a lot to think about… Thanks guys… And yeah, those outage recommendations are really good… adding them to the instances I created

  2. Jon Zobrist says:

    Just found scalr, looks very cool. I do inter-region failover on AWS via asynchronous binlog shipping through S3 with scheduled replays in another region. The cost is dependent on how much database changes happen, but so far this is much less than the cost of running a single instance full time. There are some gotcha’s with asynchronous binlog replay, so logs have to be monitored for errors, and anyone writing the SQL code needs to know not to do things that are not async-replication friendly (like using temp tables).

    Keep up the good work, I’ve added scalr to my list of things to evaluate further.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>