Amazon Web Service S3 Outage – An Own Goal or User Wakeup Call?

Cloud Outage Image.jpg

The outage of Amazon Web Service S3 service in their US-EAST-1 region at the end of February not only caused a massive impact on its clients but threw up a twitter, blog and press storm as to the implications of how users view placing all their data in the S3 cloud model.

These implications were further fuelled by the release of information by Amazon themselves giving details of what went wrong, which clearly pointed to gaps in their operational processes and by their own admission, a lack of understanding of their own infrastructure, which is a staggering admission and perhaps one that could only be weathered by a company with the power and money such as AWS. Their full explanation can be read here:

However you read it, it’s not a great endorsement of using a full public cloud model and highlights that unless you are prepared to spend the same sort of money to ensure resilience as you would in your own private cloud model; you need to expect some disruption to your business when these outages occur.

Cameron McKenzie of TechTarget called it a “Fukushima moment for cloud computing” comparing the Amazon Web Service S3 failure with the massive meltdown of the Fukushima nuclear plant in Japan and the subsequent decisions by many countries to immediately announce the decommissioning of their nuclear power programmes regardless of the cost.

McKenzie goes on to say “Before the S3 outage, people invested in Amazon cloud because they were confident in both the technology used and the manner in which it was managed. With the S3 outage, what was once confidence has been replaced with faith - and faith isn't a compelling attribute when it comes to IT risk assessment” and he’s right, many users of S3 will “hope” Amazon will put this issue right but will be crossing their fingers that there aren’t other parts of the service that have been equally poorly managed. The TechTarget article can be found here:

But before we sling some well-earned mud in Amazon’s direction, let’s also look at why so many clients were affected by a single person typing in a routine simple instruction to achieve what was billed as basic service maintenance.

Could these users have switched their services over to a more reliable region within the AWS network? Why of course they could, one of the key reasons these companies have convinced their boards and shareholders to use Amazon Web Service S3 service is that it’s in the cloud right? We don’t have to worry about infrastructure ourselves – it’s all taken care of. In addition, it will cost less than a private cloud or running it ourselves.

And Amazon, their BIG! They also have data centres split into regions across the globe, so we have access to all this infrastructure all over the world that we don’t have to buy and manage as well as it being immediately available and on demand – all we need is a credit card.

So Amazon Web Service S3 - What went wrong?

Ever heard of a silver bullet? Well many of those who were affected and didn’t have an option other than to wait for Amazon to correct the problem thought they were getting one of those.

The real issue for these companies who are using this service is surprisingly the cost. Yes, putting your data into Amazon Web Service S3 probably attracts a smaller per MB charge than a smaller MSP but unless the company puts the right architecture in place on the S3 platform to be able to switch to another region if an outage occurs then they are open to the issue - and spinning this second data instance with replication isn’t going to be cheap.

I for one can’t believe that users of S3 who bemoan the service and point to outages affecting their core business haven’t assessed the risk in using a service that doesn’t have a strong SLA and whose record of uptime isn’t amazing, and either consciously ignored the risk in favor of cost or are blissfully unaware.

Cameron at TechTarget goes on to say that the future of cloud computing has changed now that users realise that a full-scale, daylight-hours crash is always a possibility and organisations will start bringing more of their systems back into the local data centre - which is pretty radical if it’s actually happening.

He concludes by saying “The other big move will be for organisations to leverage cloud bursting technologies while making use of the cloud with their in-house systems approach capacity. But using the cloud exclusively will become a thing of the past. The Amazon Web Service S3 outage was a Fukushima moment for cloud computing, and it will forever taint the way organisations view the cloud”.

I for one hope not, as cloud represent a major leap forward in the way computing, solutions and services computing offers can be consumed.

What is clear is that the decision making process of moving or using cloud services doesn’t remove the responsibility of companies and their management teams to ensure that they fully understand how the cloud model they want to adopt will actually work, and to ensure themselves that the costs of the service that they are being proposed covers an on demand and always available service that supports the business and their customers – if that’s what the business needs.

This focus on responsibility is something that boards can’t legitimately continue to dodge as the new General Data Protection Rules (GDPR) legislation which if you’re like me is constantly filling up your inbox, is coming into force next year and makes it clear that no one is beyond the reach of the courts if data security and residency is not protected.

Why does this matter?

Well, the defense of “sorry M’lord, our data storage system had a big failure and we lost all the records your looking for” will not cut the mustard as the say in my local nor will it suffice that you are using one of the big players such as Amazon Web Service S3 service give you credit.

This will further complicates the cloud storage market and will certainly begin to bring pressure on companies to make the right commercial decisions as to what data they place into public clouds and what they retain in their own controlled private or on premise systems. It will also make IT managers think about the contingency of data access when selecting where a company’s information and systems reside and may well result in the financial considerations of moving or not moving to a public cloud being assessed in a different way when the resilience of a platform is taken into account.

The blame game is always something that can get people out of the line of fire in the heat of the moment, but the longer term damage in not designing and architecting a system with the safeguard that your business requires will always come back to roost in the end.

My advice if you’re not sure, talk to an expert.

Share this on:

Tags : Blog