Netflix went down yesterday due to power problems at Amazon’s Virginia Datacenter. This of course caused an outage for some of Amazon’s EC2 customers. (Like Netflix). This is a perfect example of why Redundancy in the cloud is important.
All of my customers (some of them at EC2 Virginia) stayed up and available. How? Redundancy.
My consulting customers are at multiple clouds and while Amazon may be the primary, data is replicated near-time (as close to real time as possible) and a switchover happened immediately (and automatically).
Mail, Web, VOIP servers all didn’t experience any downtime. I had set them up on my backup system (I have 2) and they’re all connected and aware of each other. In the event the backups can’t sense the primary, DNS servers are reset with the new data (automatically) and Backup System #2 kicks in. While DNS data propagates (which is why I have VERY SHORT Expiry times), the clients are automatically rerouted to the secondary.
This is done via a cluster of gateway proxies I have. Completely transparent. While some VOIP calls might have been disconnected, a redial would have resolved the situation. Upon analysis, (I happened to be watching the whole thing go down via a series of SMS alerts sent to me automatically by my systems), Each client was switched over to the redundant systems within less than a minute. I then received a text message stating all systems are now running on backup 2 (that had their primary at EC2 virginia).
A few hours later, I received SMS messages showing me systems are now back on primary (Back in Virginia) and of course all DB-based systems were back in sync within a few min. (I love fiber)
This is why it’s so important for architecture to be done right. How many keystrokes did I actually need to type to make this all happen?
Anyone interested in my architecture skills yet?
Oh by the way.. I was enjoying a few episodes of Earth Final Conflict from my bed during the whole thing. 🙂 [Definitely worth watching.. It’s a trip of a show..]
Not even one of my clients called me, because they didn’t notice any downtime. 🙂