Well, It would seem like Sandy did a major job on many services. I don’t think I need to list who was down, it’s a long list. Needless to say this all had to do with poor planning and testing of Fail-Over / High-Availability solutions.
The irony of the whole thing is this blog and all of my clients stayed up. Why? Strong Architecture, high-availability, and tons of support from GOOD Vendors.
So let’s go through the whole scenario shall we?
1. Tons of DNS servers. A very important thing to look for. They should be distributed all over the world.
2. Backup Machines in multiple Zones synchronizing from the master. (I run Multi-Master DBs.)
3. Proxies for High-Availability. I like HA-Proxy, but there are plenty of others out there.
4. Great Monitoring. I like Zabbix, but Cacti, Nagios, and any well set-up and tuned monitoring system will do.
5. A good sys-admin. Reality? I do my own boxes and have a backup person to manage in case I’m knocked out (or get hit by a falling piano).
6. Redundant Voice Infrastructure. (Not many people think about this, but it is incredibly important).
So how did it all go down for me? ZERO downtime. Not one service outage. My clients? They all called me in a tizzy (well, not all, but a few..)
The calls went something along the lines of, “The Hurricane’s coming.. The sky is falling..”
I replied, “And your point is? I’m your Sys-Admin and Architect.”
They respond, “Oh my god, what contingencies do you have?”
I respond, “Uhm.. you’re up. We’re highly available. Backups were performed every hour. You’re mirrored in multiple locations in the US (or World depending on the customer)”
Ultimately, did my clients go down? Nope.. Not one. How many had to fail-over? Out of ~30-ish clients small and large, just 1 had a hiccup. It lasted 5 minutes for the changeover to take place (don’t blame me.. blame DNS). Not exactly a major issue. (It was a Healthcare site that didn’t want to pay for more redundancy, so I just threw them into my redundancy pool anyways.. They were glad I did it. Guess what? They’re now paying for the redundancy.) Why? Because (yes.. I’m starting a sentence with “Because”.. shoot me). I had the foresight to not want to deal with the screaming when / if their site went down. Did I charge them for the remote switchover? no. why? It’s the sign of a good consultant. Cover your customer’s back first and never take advantage of a customer.
Ultimately, nobody had any issues. All was well. The world is a happy place.
How much time did I actually spend babysitting? less than an hour. Why? Good reporting and alerting with automated fail-over and fail-back.
To my clients: I’d love a bottle of Louis XIII.