Introduction
25 years ago, I said (really, I did!) that automatic software updates pose a greater risk than malware (ok, at that time we really only had viruses).
Many incidents since than has proven this right, but none more so than the CrowdStrike Falcon Blue Screen of Death (BSOD) incident on July 19, 2024.
Since as usual the company won't release any detailed information on what really happened, we'll have to rely on other sources. I found that Dave Plummer's account on YouTube was very good, and trustworthy.
What really happened?
In summary, after looking at crash dumps and based on his own knowledge of how the Windows kernel works, Dave Plummer explains that what happened was probably the following.
CrowdStrike has a need to check not only file signatures, but behavior in general of software on the system. To do this they have created a device driver, that doesn't actually interact with any hardware, but has achieved WHQL release signature. This means that it's very likely reliable, and certifiably from a known source. They also flagged it as boot-start driver, meaning it's really needed to boot the system. This is of course to make sure it really does get loaded, which is great - as long as it doesn't crash the system. Which, unfortunately, it did.
So how did this signed, certified, driver crash the system? In, short, by CrowdStrike hacking the protocol and Microsoft allowing it to happen. We don't know exactly what they do, but the challenge they have is that they feel the need to frequently update what behavior to watch for, and in order to do so, they essentially need to be able to update program logic frequently. They could do so by building a new driver with the new logic, and getting it WHQL signed. Two problems for them here. First, it takes time to get a driver certified. Secondly, updating a driver is not done on-the-fly, so a reboot is required.
The "solution"? To provide the driver with instructions that define the logic to execute, in other words, one form or another of P-code - or even machine code! Thus, they can keep the same driver, but update the logic by conceptually having the driver "call out" to external logic. This is what CrowdStrike misleadingly calls a "content update". I would call it a code update.
Essentially, by allowing the driver to read and perform instructions based on external "content", regardless of what you call it, they effectively bypass the whole point of WHQL certification of kernel mode drivers.
In the end though, it appears that it's just a trivial embarrassing bug in the driver that caused the crash itself, in turn triggered by some equally trivial embarrassing process error during CrowdStrike deployment of "content updates".
The "content update" that CrowdStrike sent out was full of zeroes. Nothing else. Obviously not the intended content. And this simple data caused the driver to crash, in turn causing the system to crash since it doesn't have much choice in this situation.
As the driver is also marked as a boot-start driver, it'll always get loaded on reboot, even if it crashed last time. This is what makes it so time-consuming to fix, the driver can't just be flagged as faulty.
This is just the mark of plain really really bad software. This software runs in kernel mode with full privileges and can do anything. And giving it a bunch of zeroes as input crashes it. Just. Not. Simply. Good. Enough.
Imagine what a creative hacker could do with that, inserting something more malicious than zeroes? How does full system control at kernel mode sound? This is of course speculation, but it's highly likely that it's possible with the current version of the driver.
What everyone seems to be missing...
In all the aftermath and all the comments, there are two glaring omissions in the analyses according to me.
The big failure
How is it possible that someone sends out an update affecting the behavior of kernel mode code, all at once, simultaneously, to millions and millions of systems around the whole globe at once!?
I've participated in many roll outs, and never would I allow a big-bang roll out like this. CrowdStrike should be charged with negligence for having this type of process. It's just plain irresponsible.
The only reasonable way to do global roll outs, especially for kernel mode code, is to stagger it. Start with 10 systems. What happens? Do they respond properly? Then a 100, then a 1000, etc. And since it's a global roll out, take time zones into consideration! Now we saw how the problems rolled around the globe, starting in Australia, with reports of downed systems as the working day started there.
I understand they want to get updates out at speed, but this is ridiculous. In the end, this caused way more damage than any possible threat they could have stopped by this procedure.
The smaller failure
The smaller failure, but still not mentioned by analysts is the fact that this kernel mode driver that accepts external input apparently does no input validation at all worth the name. Perhaps the content is digitally signed (I don't know, but no-one has mentioned it either), but even if it is, this type of software must assume that it can't trust external content. According to Dave Plummer, the "content update" in question was all zeroes, so at least no embedded digital signature, apparently.
If it's not - all an attacker would have to do is to deposit a file in %WINDIR%\System32\drivers\CrowdStrike with a name such as C-00000291*.sys containing zeros - and the system becomes unbootable without manual intervention!
Once again, this is just not good enough, and should be cause for some lawsuits in the coming months. A manufacturer of critical security software executing in kernel mode should not be allowed to sell code this bad without financial consequences.