“Where did CrowdStrike go wrong?” is, if anything, a bit of an over-determined question.
We can work backwards. If an update is pushed out to every computer on the network at the same time, by the time a problem is discovered, it is too late to limit the consequences. The alternative (a phased rollout) would be to push the update out to users in small groups, typically accelerating over time. If you start by updating 50 systems at a time and then immediately lose contact with each one, you might expect to catch the update before updating the next 50 million.
If you’re not going to do a phased rollout, before you push the update out to users, you should test it. Typically, the scope of pre-release testing is a contested area: there are countless possible configurations of hardware, software, and user requirements, and any testing regimen must be limited to the ones that matter, and hope that nothing slips through the cracks. Fortunately, when an update bricks 100% of the computers it’s installed on, rendering them inoperable until an onerous fix is manually applied, it’s pretty easy to conclude that it wasn’t tested enough.
If you are not going to do a phased rollout and test your update before you push it, then you need to make sure that it is It is not broken.
It was broken
In CrowdStrike’s defense, you can understand why some of this happened the way it did. The company offers a service called “endpoint protection,” but if you’ve been in the Windows ecosystem for a few years, it might be easier to think of it as an antivirus. It’s geared toward the corporate market, rather than consumers, and in addition to protecting against common malware, it attempts to prevent individual computers used by businesses from becoming footholds on the corporate network.
This doesn’t just affect computers used by large companies that need to provide a keyboard and mouse for every employee, but also any other company that has huge fleets of cheap and flexible machines. If you left home on Friday, you’ll have seen what this means: advertising displays, point-of-sale machines and self-service kiosks were all affected.
The comparison is important because in CrowdStrike’s field, speed is of the essence. The worst-case scenario – at least until last week – is a ransomware worm like WannaCry or NotPetya: malware that not only causes critical damage to affected machines but can also spread within and across corporate networks automatically. That’s why the first line of defense is working at full speed. Instead of waiting for a weekly or even monthly software update to be released, the company pushes out files daily to cover the latest threats to the systems it protects.
At the margins, even a phased rollout could cause real damage: WannaCry locked down computers across much of the NHS in the hours it was allowed to spread unchecked, before British security researcher Marcus Hutchins accidentally stopped it while trying to figure out what made it tick. In that scenario, a phased rollout could cost lives. A delay in testing could cost more.
So updates aren’t supposed to cause these kinds of problems. Rather than being new code that runs on every machine, they’re more like dictionary updates: they tell the already installed CrowdStrike software what new threats to look out for and how to recognize them.
At the most flexible end, you can think of it as, well, this article. You’re almost certainly reading these words through some kind of application, be it a web browser, a mail client, or the Guardian app. (If you’ve managed to get a deal whereby someone prints this out and hands it to you with your morning coffee, congratulations.) I haven’t done a phased rollout or full testing of the article, because I shouldn’t be doing anything.
Unfortunately, the update that was released on Friday did do something. The high-level technical details remain hazy, and until CrowdStrike deigns to publish a full breakdown of what it did, we’ll be left with what we’ve been told. The update, which was meant to teach the system how to detect a particular type of cyberattack that had already been observed on the network, instead “triggered a logic error that resulted in an operating system crash.”
I’ve been covering this sort of thing for over a decade, and I’m guessing the “logical error” will turn out to be one of two things: either something in one of the most complex systems humanity has ever built will go into a barely comprehensible state of failure and an almost inconceivable combination of bad luck will have led to something catastrophic happening; or someone did something incredibly stupid.
Sometimes there are no lessons
There have been many shots in the last few days:
-
This is an inevitable damage caused by the concentration of power in a few companies in the technology sector.
-
This is an inevitable damage of the EU ban on Microsoft limiting the power of antivirus companies to rewrite the core level of Windows.
-
This is an inevitable downfall of cybersecurity regulations that are more concerned with compliance than actual security.
-
It wasn’t a security issue at all, as no one was attacked. It was just a bug.
None of them have been successful. CrowdStrike, for all the disruption, does not have a huge concentration of power; it is one of the largest companies in its sector, but it is installed on only 1% of all PCs. And although Microsoft has tried Insisting that the failures occurred solely because of regulation.The alternative, where third-party security companies are unable to operate on Windows, appears to be a world where the first big flaw actually affects 100% of PCs, because Microsoft has set itself up as the only line of defense.
Cybersecurity regulations actually reward companies for installing CrowdStrike, turning a complex certification process into a simple box-ticking process, but that’s probably a good thing, too. “Buy what guarantees security” is the only reasonable requirement for the vast majority of companies, and CrowdStrike did the job — except on that unfortunate occasion.
But, unfortunately or not, this was clearly a security issue. There are three objectives in the golden triangle of information security: confidentiality (are secrets secret?), integrity (is the data correct?) and availability (are the systems usable?). CrowdStrike failed to preserve availability, and that meant it failed to protect the security of its customers’ information.
In the end, the only lesson I feel comfortable drawing is that these kinds of things are going to happen more often. We’ve successfully addressed so many failed states in society that the ones that still affect us are going to be increasingly surprising, severe, and unprepared for. Similar to how drivers become overconfident with cruise control and find themselves unable to take control in the split second before an accident, we’ve managed to make catastrophic IT outages rare enough that recovering from them is a marathon task.
Hurrah?
The broader TechScape
-
‘A river of total rubbish’: Guardian Australia’s Josh Taylor helped unleash Facebook and Instagram algorithms In clean slate. They served sexism and misogyny.
-
Is the world’s largest search engine broken? Tom Faber asks if Google is losing its edge.
-
Is this the end of the The Craig Wright Saga? Publication a complete court ruling on your Twitter feed declaiming that the last decade of your career feels definitive.
-
Parents have even more to worry about with AI, as the technology is outpacing efforts to detect it. child predators.
-
AND Roblox is once again the center of attention on its own Failures surrounding child sexual abusecompounded by the company’s privacy stances, critics say.