The Psychology of the AWS Outage
Unless you've been living on another planet, you're certainly aware that over the past couple of hours Amazon's AWS S3 service has experienced a serious outage, which has affected thousands of sites and services around the world. For reasons I will elaborate in this post, the coverage of this outage has been blown completely out of proportion. So, what's the difference between the perceived risk associated with the AWS outage and the actual risk of this outage?
First, the IT operations reliability that most organizations can achieve is far lower than what Amazon is delivering, even taking into account today's incident. Two hours of downtime per year translate to a 99.98 availability, and this is without taking into account that AWS has not had similar large-scale incidents in the past years. Amazon is claiming up to 99.99% availability, which means it is not designed to surpass that figure. Those claiming they could do better, are deluding themselves and probably also misleading their employers or clients.
Second, modern computing environments are incredible complex hardware and software stacks. Keeping a production environment running reliably requires many dedicated well-paid specialists ranging from network experts, to kernel gurus, to C++ wizards. It also requires top-notch vendor relations that can give you access to real engineers rather than help-desk psychotherapists. On a 24 by 7 schedule. Satisfying this requirement is becoming a tall (expensive) order, even for large-sized organizations.
So why is the AWS outage a big deal, whereas when you bank's data center experiences a planned or unplanned night-long downtime this is less of an issue? To understand this, look at how Bruce Schneier explains the difference between the perception of risk and the actual reality in his book Beyond Fear and in his 2008 essay The Psychology of Security.
- "People exaggerate spectacular but rare risks and downplay common risks." The AWS outage is spectacular and rare, whereas other IT outages are certainly more common; all of us can relate many IT problems we've experienced over the past year.
- "Personified risks are perceived to be greater than anonymous risks." Again, people all over the internet are reporting they can't dim their bedroom lights or access the Strava exercise monitor app. These are a lot more personified problems than a random organization's networking outage.
- "People underestimate risks they willingly take and overestimate risks in situations they can't control." Need I say more? When we manage our own iron we underestimate the risks, but when we hand over our IT infrastructure to a cloud service, which we can't control, we overestimate them.
- "People overestimate risks that are being talked about and remain an object of public scrutiny." Amazon S3 is currently trending on Twitter with 36.2K Tweets. So this certainly fits the bill.
Consequently, if you're wondering why the AWS outage is such big news, the answer is simple: human psychology.Read and post comments, or share through