Cloud Services Are Great - But The Basics Still Apply
Online service providers have forgotten basic systems management practices with the result that brand loyalties and reputation can be destroyed overnight.
I saw a worryingly familiar post the other day: “Canon says its cloud service can’t restore users’ lost videos or full-size images. Videos and full resolution images permanently lost”.
Once again a major corporation faces significant damage to its reputation because a fairly junior member of staff was allowed to make a very silly mistake.
In this particular case a member of staff thought they were doing a standard task of clearing down data on a behind-the-scenes server. Except it wasn’t a behind-the-scenes server! It was the entire global live service on which customers stored their videos and pictures. To compound the issue the backups were, erm, absent. What about the backup of the backups? They didn’t have any of those either.
Millions of videos and pictures - gone forever and an entire customer-base fuming.
It is a scenario that I have seen time and again - whether it is Air Traffic Control systems going offline and grounding all planes in the UK and France, or millions of customers locked out of their online banking accounts, Google’s Nest thermostats all stopping leaving customers shivering, or a “routine refresh” causing 60% of the Starbucks in the US and Canada to close for the day. The cause invariably is put down to “Technical issues or a software glitch”.
But let’s be clear, these mistakes are avoidable. These are not “cyber attacks”, where malicious activity overwhelms the systems; these are outages caused by the IT department itself.
The cause is not “technical issues or a software glitch”; the cause is that the companies decided not to invest in well-established risk-mitigation processes required to prevent or ameliorate mistakes.
The average IT organisation is permanently stuck in a dilemma - bring new functionality online as quickly as possible to support profitability while ensuring that what is already working is not broken in the process.
IT is also quite tribal, particularly when new technologies emerge. Along with the new technologies come new mindsets. Amongst the current crop is an idea called Continuous Integration and Continuous Delivery (CICD). If you Google CICD you get the following definition “The adoption of CICD has changed how developers and testers ship software. CICD represents a culture and process around constantly integrating new code.”
What that means in essence is that the IT department always has a series of updates coming down the track that it has to deploy to the live environments.
When this is done well, it is great … small changes being made continually in a smooth delivery process. The problem comes when something unexpected happens. Either the software was not designed to be added to in neat small packages and actually has to be updated en masse, with unpredictable results (as is often the case with monolithic corporate systems that have been amended to face the web), or the impact of the change is not properly assessed, or a human makes a perfectly natural human mistake at 3am at the end of an eight-hour shift.
Sooner or later some poor sap is going to press <enter> and the world is going get a lot less rosy.
But these are not new problems, they have been the case for 50+ years. So why do IT departments keep making the same mistakes?
The answer is that they don’t.
The people that were running the departments 50, 30, or even 10 years ago have moved on and with them have gone those fuddy-duddy old processes they created that slowed down deployments. To IT professionals that were around 10 or 20 years ago, the extent to which "classic" risk mitigation and system management disciplines have disappeared is shocking. One talked about their department saying, “We have far more functionality but far less reliability”.
One of these disciplines is called Change Enablement, aka Change Management. While even the phrase is anathema to some disciples of CICD, the reason it is needed is that occasionally things don’t go the way you expected.
When the unexpected happens it is reasonable to ask the IT department to put things back the way they were, and that’s where reality butts heads with some of the new mantras. One such is that “we don’t roll-back changes, we fix-forward”. This translates to “we want all the users to suffer whatever problems the change caused until we figure out a way of fixing it”.
So why can’t they put everything back the way it was while they work out a fix?
One of the reasons is cost - you need to have an identical environment for the developers to work on while the live system is rolled back, but in my experience that’s not the biggest issue. The real issue is that because there is no paper-trail of changes, no-one knows precisely what the live environment actually is, other than the culmination of the last few years of changes.
OK, that’s a little harsh on a lot of well-run IT departments, but in others it is the widely-known dirty little secret.
There are good arguments against imposing a formal Change Enablement process. It slows down the CICD train, not hugely, but it does. Some mistakes will always get through, but just taking a backup before a change is made can be seen as an unnecessary delay. You know what though? That would be better than 'poof' we overwrote all the data in the cloud and can’t get it back.
Many non-IT business managers are unaware of these dangers as ‘The Cloud’ has been touted as a magical place where everything is backed up automatically, can be recovered at the drop of a hat, and miraculously expands and contracts as capacity demands.
All of that is true, as long as your IT department tells it to do so and takes advantage of its capabilities. If they don’t, the Cloud is just as dumb as any other computer and if you tell it to delete everything without taking a backup copy, it will happily oblige.
This is why I remain fundamentally opposed to the “update without rollback” approach to managing SaaS/Cloud environments, and why I’ve introduced robust (but hopefully light) Change Enablement processes anywhere I have been in charge. Often to the annoyance of the CICD community.
Imagine if this was companies’ inventory data rather than end-users photographs!
Editor's note: Nick Goss has run online environments supporting 50+ million end-users and been named a Premier 100 IT Leader by Computerworld Magazine.