25Apr/11
Amazon’s recent EBS outage has once again shown how important it is to architect your application to tolerate failure. This post is the first in a series focused on giving you tips to have Scalr handle failure tolerance for you.
Failure recovery & Failure tolerance
There are two concepts I’d like to define: failure recovery, and failure tolerance. Failure recovery is when a disaster can take your application down, but the app can nevertheless recover from it. Failure tolerance is when a disaster disrupts service, but does not take the application down.
Failure tolerance is more expensive than failure recovery, since you must have redundant servers running. If your application is important enough to you, you might prefer failure tolerance over failure recovery.
That said, how can you get Scalr to handle failure for you?
How to survive an AZ outage
Provided you follow a few best practices, it’s very easy to have Scalr make your site tolerate failure or recover from it.
Say you have a load balancer, web server, and database server (with EBS volume) in the availability zone A. If A goes down, we’ll spin up three similar instances in zone B, take the latest backup snapshot that we made for you of the EBS volume, and mount it on the new database server so you have your data again. Once A comes back online again, you can then recover the data between backup and outage.
How Scalr helps
What Scalr is working on to make this easier
To make this even easier, we changed the defaults for mysql to make snapshots automatic every 24 hours, rotate them 10 times (which means we discard the 11th), and run a backup every 12 hours. The new Scalarizr agent also lets you run mysql instances across multiple AZs.
We also renamed “Choose Randomly” to the more descriptive “AWS-chosen”, and “Place in different zones” to “Distribute equally”.
Finally, we changed the default placement for images to be “Distribute equally”.
Next in this series: How to survive a Region outage, How to survive a Cloud outage, and How to survive degraded functionality
Good stuff. This is the type of investment that the right SaaS mindset requires. No excuses, if you blame Amazon, you aren’t stepping up to win the SaaS game!!
I blogged related to this today…! http://first-productmarketing.blogspot.com/2011/04/rather-than-piling-on-amazonlets-talk.html on the SaaS Mindset!
Ken
Isn’t there a performance cost involved in putting the app server, masters, slaves in different zones?
If so, how can we estimate the impact on performance before performing this change?
In different zones, not really. I haven’t observed increased latency between AZs. If between regions, then yes, you’ll have a performance hit.
Looking forward to how to survive a region outage ;p