Introduction
25 years ago, I said (really, I did!) that automatic software updates pose a greater risk than malware (ok, at that time we really only had viruses).
Many incidents since than has proven this right, but none more so than the CrowdStrike Falcon Blue Screen of Death (BSOD) incident on July 19, 2024.
Since as usual the company won't release any detailed information on what really happened, we'll have to rely on other sources. I found that Dave Plummer's account on YouTube was very good, and trustworthy.
What really happened?
In summary, after looking at crash dumps and based on his own knowledge of how the Windows kernel works, Dave Plummer explains that what happened was probably the following.
CrowdStrike has a need to check not only file signatures, but behavior in general of software on the system. To do this they have created a device driver, that doesn't actually interact with any hardware, but has achieved WHQL release signature. This means that it's very likely reliable, and certifiably from a known source. They also flagged it as boot-start driver, meaning it's really needed to boot the system. This is of course to make sure it really does get loaded, which is great - as long as it doesn't crash the system. Which, unfortunately, it did.
So how did this signed, certified, driver crash the system? In, short, by CrowdStrike hacking the protocol and Microsoft allowing it to happen. We don't know exactly what they do, but the challenge they have is that they feel the need to frequently update what behavior to watch for, and in order to do so, they essentially need to be able to update program logic frequently. They could do so by building a new driver with the new logic, and getting it WHQL signed. Two problems for them here. First, it takes time to get a driver certified. Secondly, updating a driver is not done on-the-fly, so a reboot is required.
The "solution"? To provide the driver with instructions that define the logic to execute, in other words, one form or another of P-code - or even machine code! Thus, they can keep the same driver, but update the logic by conceptually having the driver "call out" to external logic. This is what CrowdStrike misleadingly calls a "content update". I would call it a code update.
Essentially, by allowing the driver to read and perform instructions based on external "content", regardless of what you call it, they effectively bypass the whole point of WHQL certification of kernel mode drivers.
In the end though, it appears that it's just a trivial embarrassing bug in the driver that caused the crash itself, in turn triggered by some equally trivial embarrassing process error during CrowdStrike deployment of "content updates".
The "content update" that CrowdStrike sent out was full of zeroes. Nothing else. Obviously not the intended content. And this simple data caused the driver to crash, in turn causing the system to crash since it doesn't have much choice in this situation.
As the driver is also marked as a boot-start driver, it'll always get loaded on reboot, even if it crashed last time. This is what makes it so time-consuming to fix, the driver can't just be flagged as faulty.
This is just the mark of plain really really bad software. This software runs in kernel mode with full privileges and can do anything. And giving it a bunch of zeroes as input crashes it. Just. Not. Simply. Good. Enough.
Imagine what a creative hacker could do with that, inserting something more malicious than zeroes? How does full system control at kernel mode sound? This is of course speculation, but it's highly likely that it's possible with the current version of the driver.
What everyone seems to be missing...
In all the aftermath and all the comments, there are two glaring omissions in the analyses according to me.
The big failure
How is it possible that someone sends out an update affecting the behavior of kernel mode code, all at once, simultaneously, to millions and millions of systems around the whole globe at once!?
I've participated in many roll outs, and never would I allow a big-bang roll out like this. CrowdStrike should be charged with negligence for having this type of process. It's just plain irresponsible.
The smaller failure
If it's not - all an attacker would have to do is to deposit a file in %WINDIR%\System32\drivers\CrowdStrike with a name such as C-00000291*.sys containing zeros - and the system becomes unbootable without manual intervention!
No comments:
Post a Comment