Human factors of security and cybersecurity professor at University of 蓝莓视频 reflects on CrowdStrike global tech outage

Wednesday, July 24, 2024

Written by Dr. Kami Vaniea and Regina Ashna Singh

hands in front of laptop that is displaying blue screen with sad face emoji

The Blue Screen of Death 鈥 a looming nightmare of many 鈥 came to fruition last week during a global tech outage set off by one of the largest cybersecurity companies in the world.

According to , 鈥渢he reason for the outage is a single software update originating from cybersecurity firm CrowdStrike. The faulty update has caused some computers running Windows to experience the Blue Screen of Death. In other words, instead of booting up as normal, affected computers are crashing. The update did not impact computers running Mac or Linux.鈥

However, the University of 蓝莓视频, thankfully, was unaffected. Kami Vaniea, professor in the Department of Electrical and Computer Engineering, says the shutdown 鈥渉it the airlines, it hit the railways鈥t hit large companies such as hospitals and other big groups that care about security and therefore invested in their safety by hiring a security company, namely CrowdStrike.鈥

estimates that 鈥淐rowdStrike鈥檚 update affected 8.5 million Windows devices, or less than one percent of all Windows machines.鈥

What led to the outage?

The Cybersecurity and Privacy Institute (CPI) member, Vaniea, said she believes at its core, the CrowdStrike outage is happening because of two related issues that were not handled well: testing and update management.

Normally when popular software needs to be changed, the vendor creates an update, tests it, and releases it for other users to install. In large organizations, like airlines, a system administrator will first check new updates to make sure they will work as expected. Often, they will also perform a "staged deployment" which means they will update the least critical computers first, ensure everything is stable and then update the most critical devices.

headshot of kami vaniea that replicates prior text

Security tools, such as anti-virus, often bypass these tests when updating lists of 鈥渟ignatures鈥 which are basically lists of 鈥榖ad-thing鈥 patterns. Tests of such files are done by the security company, but typically not by client organizations because signatures are updated multiple times a day and are just lists, not computer code. It is also vital to block identified 鈥榖ad things鈥 quickly; waiting for tests can be dangerous.

Prior to this particular outage, CrowdStrike issued an update that had an error no one knew about鈥heir internal testing should have caught the error, but it did not, and professor Vaniea says, 鈥淲e don't know why yet鈥︹ The error only happens under very special circumstances, that in theory, should never occur. But on Friday, July 19, 2024, CrowdStrike published a configuration file that caused those special circumstances which was automatically downloaded by all computers running their software, bypassing all testing normally done by system administrators. The combination of the old error no one had noticed, and the new configuration file, caused a problem in how Windows boots up.

How do people and companies impacted move forward?

Typically, issues caused by errors are fixed quickly and the public only experiences a downtime of a few hours. To illustrate, Facebook experienced a mass outage back in October 2021, but their team was able to rectify the problem in less than seven hours as reported by . However, the CrowdStrike issue is taking days to fix because system administrators must manually remove the problematic configuration file from every single device affected. The good news is that most employees with a computer science based education are able to execute this solution, and the majority of the companies impacted by the outage will have a dedicated team and sufficient resources to roll out the repair plan. However, visiting and fixing every computer still takes time.

What are the implications of this global tech outage?

There are several implications that surface as a result of the CrowdStrike crisis:

Human Computer Interaction (HCI)

Professor Vaniea鈥檚 specifically focuses on exploring the reasons why both 鈥渘ormal鈥 people and system administrators are reluctant to install updates. Below are just a couple of the numerous examples:

  1. Risk 鈥 for system administrators, for instance, the potential risk of downtime or bugs impedes their decisions to download an update.
  1. Prevention of loss 鈥 for 鈥渘ormal鈥 people, sometimes the potential to lose a specific program that is beneficial to their livelihood impacts their decisions to install an update. For example, imagine you were about to give a presentation and an 鈥淯pdate PowerPoint鈥 dialog popped up. You might rationally decide to delay updating until after your presentation.

Lawsuits can also cause apps to be forcibly updated. For example, in 2012, the app - which helps kids who struggle to speak due to issues like autism and cerebral palsy - was pulled from the Apple App Store due to a lawsuit. To prevent the app from vanishing, some parents disabled all communication with the Apple App store.

flowchart describing the overview of the stages of updating and the issues respondents experience at each stage

Source:

Automatic updates are key

Professor Vaniea says 鈥渁utomatically updating is much safer鈥 and advises organizations and users to get on board with this strategy if it is not implemented already. After an update is released, attackers look at the updated code, learn what it is fixing, and then write attack code to target that specific thing. Those who update quickly will be protected, but those that update slowly risk attack. For example, in 2017 . An update that could have fixed the problem was available for two months before the breach. Vaniea says, 鈥淚f Equifax had installed it, they would not have lost the data.鈥

Lack of diversity, power grids, and cyberattacks

The CrowdStrike tech outage is partially a testament to CrowdStrike鈥檚 market reach. 鈥淎ll these organizations are either CrowdStrike clients or someone who depends on a CrowdStrike client,鈥 states Vaniea. She goes on to say, 鈥淲hile great for CrowdStrike, having so many organizations depend on one company creates a lack of diversity. If a vulnerability or flaw is found in that one dependency, then everyone depending on it is impacted.鈥

If one looks beyond large tech companies to infrastructure like power and water grids, it is natural to worry that something similar might happen. While possible, Vaniea points out that most of these systems are fairly old, predating the modern tendency to network everything. That means they are all running different software and while that software is likely not 100% secure, an attack that works on one part of the grid is unlikely to work on other parts. In other words, these grids have a diversity of systems, so one attack is unlikely to impact the whole grid.

Vaniea says, 鈥淭he modern approach of running the same software on many systems makes it easier to keep them updated, fix identified vulnerabilities, and provide maintenance. But it also means that if an attacker finds a vulnerability, then every computer of that type is vulnerable.鈥

Further research and understanding of human behaviour with computers, enforcing automatic updates, and increasing diversity will help prevent a CrowdStrike-like disaster in the future.