Shortly before 1 a.m. local time on Friday, a systems administrator at a West Coast company that handles funeral and mortuary services suddenly woke up to find his computer screen was on. When he checked his company phone, it was filled with messages about what his colleagues were calling a network problem. Their entire infrastructure was down, threatening to disrupt funerals and burials.
It soon became clear that the massive outage had been caused by CrowdStrike’s service outage. The security firm accidentally caused chaos around the world on Friday and over the weekend after distributing faulty software to its Falcon monitoring platform, affecting airlines, hospitals and other businesses both small and large.
The administrator, who asked to remain anonymous because he is not authorized to speak publicly about the outage, sprang into action. He ended up working nearly 20 hours a day, driving from morgue to morgue and rebooting dozens of computers in person to resolve the problem. The situation was urgent, the administrator explains, because the computers needed to be brought back online so there would be no interruptions in scheduling funeral services and in the morgue’s communication with hospitals.
“With an issue as severe as we saw with the CrowdStrike service outage, it made sense to make sure our business was up and running so these families could access services and be with their loved ones,” the system administrator said. “People are grieving.”
CrowdStrike’s flawed update brick Some 8.5 million Windows computers worldwide, sending them into the dreaded Blue Screen of Death (BSOD) spiral. “The trust we built up over years was lost in dribs and drabs in a matter of hours, and it was a gut punch,” said Shawn Henry, chief security officer at CrowdStrike. wrote on LinkedIn On Monday morning, the company announced that it had committed to protecting customers and partners. “But this pales in comparison to the pain we have caused them. We have let down the people we were committed to protecting.”
Cloud platform outages and other software problems (including malicious cyberattacks) have led to major IT outages and global disruptions in the past. But last week’s incident was particularly notable for two reasons. First, it was due to a bug in software meant to help and defend networks, not harm them. And second, fixing the problem required direct access to each affected machine; a person had to manually boot each computer into Windows Safe Mode and apply the fix.
IT work is often unglamorous and thankless, but the CrowdStrike disaster has been a next-level ordeal. Some IT professionals had to coordinate with remote or multi-location employees across borders, guiding them through manually rebooting devices. A junior systems administrator based in Indonesia working for a fashion brand had to figure out how to overcome language barriers to do so. “It was overwhelming,” he says.
“We don’t get detected unless something bad happens,” a systems administrator at a healthcare organization in Maryland told WIRED.
That person woke up shortly before 1:00 a.m. EDT. Screens at the organization’s physical facility had gone blue and were unresponsive. Her team spent several early morning hours getting servers back up and running, then had to manually fix more than 5,000 devices within the company. The outage blocked phone calls to the hospital and disrupted the system that dispenses medications — everything had to be typed in by hand and run to the pharmacy on foot.