A client of ours wants to host their video subscription site on Amazon’s cloud services. Last week they had an outage. Egads, this was not to supposed to happen. A point of view by Scott Gilbertson:
Amazon Autopsy Reveals Causes of Cloud Death
Amazon is also promising to improve its communication with customers when things go wrong, but as we pointed out earlier, the real problem is not necessarily Amazon. While Amazon’s services unquestionably failed, those sites that had a true distributed system in place (e.g. Netflix, SmugMug, SimpleGeo) were not affected.
In the end it depends how you were using EC2. If you were simply using it as a scalable web hosting service, your site went down. If you were using EC2 as a platform to build your own cloud architecture, then your services did not go down. The later is a very complex thing to do, and it’s telling that the sites that survived unaffected were all large companies with entire engineering teams dedicated to creating reliable EC2-based systems.
That may be the real lesson of Amazon’s failure — EC2 is no substitute for quality engineers.
Amazon has offered its promised apology. It’s published its post-mortem on the recent outage of its AWS EC2 (Amazon Web Services Elastic Compute Cloud) and RDS (Relational Database Service). It says what went wrong and how it’s planning to avoid such problems in the future.