So it appears the Internet went down, or so many claimed when they were presented with 404 errors when attempting to watch “Georgia Hillbilly Massacre 17: The return of the Banjo Man” on Netflix – Since Netflix is selective on what you can stream they certainly weren’t queuing up the latest and greatest new releases, but that is a totally different rant – or attempting to declare themselves the Mayor of “who gives a rats ass where you are right now” on Foursquare.
Last time this happened some started to claim that it rocked the very foundation of confidence in cloud-computing (here), yet they failed to juxtapose Amazon’s operational failures against the universe of enterprise operational failures, security compromises and general administrative stupidity that plagues nearly 99.98% of every organization on Earth (minus the DPRK’s website, really not more you can do to fudge that one up)
This is not the first time we have or will experience a catastrophic failure of a 3rd parties cloud-computing infrastructure, managed service, or internet delivered application, but we do need to keep some things in perspective:
1. Most of these providers, Amazon included, has a dedicated team of 24/7 operational experts that span across the various IT domains and their mission is ‘availability and resiliency of services’ – does your company have such 24/7 resources on staff?
2. Most of these providers, Amazon included, maintain multiple WW geographic locations, so one could quickly spin up a cluster of services in region B if region A becomes unavailable, or region C if both A and B are down – does your company maintain this level of fail-over infrastructure?
3. Most of these providers, Amazon included, provide methods, software and infrastructure to better enable back-up and recovery, fault-tolerance and load-balancing, including separate, isolated, storage facilities with massive (and I mean massive) storage capacity – does your company maintain this level of storage capacity?
4. None of these providers, Amazon included, are resistent to operational failures, security compromises, or administrative mistakes – they will happen, but the question shouldn’t be how can we prevent all incidents from occurring, the question should be how resilient is something to an incident, and based on the failures to date Cloud-Computing is pretty damn resilient if you use it properly – does your company have a flawless, 100% uptime environment that has never experienced a service interruption due to operational failure, security compromise or administrative stupidity?
The real key is to understand that simply moving an application to a cloud-computing provider doesn’t mean you instantly benefit from all that the ‘cloud’ offers, you must still develop applications and a strategy that takes advantage of the elasticity, resiliency, and availability of cloud-computing.
Now can we get back to watching the world burn…