25Apr/11

4

Using Scalr to avoid future Amazon problems: Surviving AZ outages

Amazon’s recent EBS outage has once again shown how important it is to architect your application to tolerate failure. This post is the first in a series focused on giving you tips to have Scalr handle failure tolerance for you.

Failure recovery & Failure tolerance

There are two concepts I’d like to define: failure recovery, and failure tolerance. Failure recovery is when a disaster can take your application down, but the app can nevertheless recover from it. Failure tolerance is when a disaster disrupts service, but does not take the application down.

Failure tolerance is more expensive than failure recovery, since you must have redundant servers running. If your application is important enough to you, you might prefer failure tolerance over failure recovery.

That said, how can you get Scalr to handle failure for you?

How to survive an AZ outage

Provided you follow a few best practices, it’s very easy to have Scalr make your site tolerate failure or recover from it.

  • First, make sure your user uploaded content (uploaded images for a blog post, pdfs attached to a wiki) is stored on persistent storage like S3 or Cloud Files. We all know that storage on instances is ephemeral, so this is not only a best practice but the only working practice. As an alternative, you can rsync these files between servers, or use software like Gluster.
  • Second, you should leave Scalr’s placement default to “AWS-chosen” or choose “Distribute equally”. With these choices, Scalr will be able to launch instances in another AZ for you should one or more fail. If you set your load balancer and application / web servers to this, you’ll continue serving pages through failure.
  • Same applies to your database: select ”AWS-chosen” or “Distribute equally”. With these, you’ll have mysql slave servers in AZs other than the one your master is in, so in the event the AZ that contains your master goes down, we’ll be able to promote one of the running slaves to become the new master.

Say you have a load balancer, web server, and database server (with EBS volume) in the availability zone A. If A goes down, we’ll spin up three similar instances in zone B, take the latest backup snapshot that we made for you of the EBS volume, and mount it on the new database server so you have your data again. Once A comes back online again, you can then recover the data between backup and outage.

How Scalr helps

  • Scalr automatically creates volumes from recent snapshots, and mounts them on your database
  • Scalr automatically promotes slave databases to masters
  • Scalr automatically updates the database endpoints so your application doesn’t read/write data to a dead IP
  • Scalr automatically launches instances in other AZs  to scale with the increased traffic on remaining instances
  • Scalr automatically updates the load balancer to stop forwarding traffic to the dead web servers
  • Scalr automatically distributes your instances across AZs

What Scalr is working on to make this easier

To make this even easier, we changed the defaults for mysql to make snapshots automatic every 24 hours, rotate them 10 times (which means we discard the 11th), and run a backup every 12 hours. The new Scalarizr agent also lets you run mysql instances across multiple AZs.

We also renamed “Choose Randomly” to the more descriptive “AWS-chosen”, and “Place in different zones” to “Distribute equally”.

Finally, we changed the default placement for images to be “Distribute equally”.

Next in this series: How to survive a Region outage, How to survive a Cloud outage, and How to survive degraded functionality

4 Responses to Using Scalr to avoid future Amazon problems: Surviving AZ outages

  1. ken says:

    Good stuff. This is the type of investment that the right SaaS mindset requires. No excuses, if you blame Amazon, you aren’t stepping up to win the SaaS game!!

    I blogged related to this today…! http://first-productmarketing.blogspot.com/2011/04/rather-than-piling-on-amazonlets-talk.html on the SaaS Mindset!

    Ken

  2. Nir says:

    Isn’t there a performance cost involved in putting the app server, masters, slaves in different zones?

    If so, how can we estimate the impact on performance before performing this change?

    • Sebastian says:

      In different zones, not really. I haven’t observed increased latency between AZs. If between regions, then yes, you’ll have a performance hit.

  3. Alex Chang says:

    Looking forward to how to survive a region outage ;p

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>