Did Microsoft massively screw up by not guarding against this particular failure mode? Oh, absolutely, everyone agrees on that.
I’m sorry, this is wrong, and that everyone thinks so is also wrong—some people got this right.
Normal Windows kernel drivers are sandboxed to some extent. If a driver segfaults, it will be disabled on the next boot and the user informed; if that fails for some reason, you can tell the the computer to boot into ‘safe mode’, and if that fails, there is recovery mode. None of these options require the manual, tedious, error-prone recovery procedure that the Crowdstrike bug does.
This is because the Crowdstrike driver wants to protect you from malware in other drivers (i.e. malicious kernel modules). (ETA: And it wants to inspect and potentially block syscalls from user-space applications too.) So it runs in early-loading mode, before any other drivers, and it has special elevated privileges normal drivers do not get, in order to be able to inspect other drivers loaded later.
(ETA: I’ve seen claims that Crowdstrike also adds code to UEFI and maybe elsewhere to prevent anyone from disabling it and to reenable it; which is another reason the normal Windows failsafes for crashing drivers wouldn’t work. UEFI is, roughly, the system code running before your OS and bootloader.)
The fact that you cannot boot Windows if Crowdstrike does not approve it is not a bug in either Windows or Crowdstrike. It’s by design, and it’s an advertised feature! It’s supposed to make sure your boot process is safe, and if that fails, you’re not supposed to be able to boot!
Crowdstrike are to be blamed for releasing this bug—at all, without testing, without a rolling release, without a better fallback/recovery procedure, etc. Crowdstrike users who run mission-critical stuff like emergency response are also to blame for using Crowdstrike; as many people have pointed out, this is usually pure box-ticking security theater for the individuals making the choice whether to use Crowdstrike; and Crowdstrike is absolutely not at the quality level of the core Windows kernel, and this is not a big surprise to experts.
But Crowdstrike—the product, as advertised—cannot be built without the ability to render your computer unbootable, at least not without a radically different (micro)kernel design that no mainstream OS kernel in the world actually has. (A more reasonable approach these days would be to use a hypervisor, which can be much smaller and simpler and so easier to vet for bugs, as the recovery mechanism.)
Addendum 2: this particular quoted comment is very wrong, and I expect this is indicative of the quality of the quoted discussion, i.e. these people do not know what they are talking about.
Luke Parrish: Microsoft designed their OS to run driver files without even a checksum and you say they aren’t responsible? They literally tried to execute a string of zeroes!
Luke Parrish: CrowdStrike is absolutely to blame, but so is Microsoft. Microsoft’s software, Windows, is failing to do extremely basic basic checks on driver files before trying to load them and give them full root access to see and do everything on your computer.
The reports I have seen (of attempted reverse-engineering of the Crowdstrike driver’s segfault) say it did not attempt to execute the zeroes from the file as code, and the crash was unrelated, likely while trying to parse the file. Context: the original workaround for the problem was to delete a file which contains only zeroes (at least on some machines, reports are inconsistent), but there’s no direct reason to think the driver is trying to execute this file as code.
And: Windows does not run drivers “without a checksum”! Drivers have to be signed by Microsoft, and drivers with early-loading permissions have to be super-duper-signed in a way you probably can’t get just by paying them a few thousand dollars.
But it’s impossible to truly review or test a compiled binary, for which you have no sourcecode or even debug symbols, and which is deliberately obfuscated in many ways (as people have been reporting when they looked at this driver crash) because it’s trying to defend itself against reverse-engineers designing attacks. And of course it’s impossible to formally prove that a program is correct. And of course it’s written in a memory-unsafe language, i.e. C++, because every single OS kernel and its drivers are written in such a language.
Furthermore, the Crowdstrike product relies on very quickly pushing out updates to (everyone else’s) production to counter new threats / vulnerabilities being exploited. Microsoft can’t test anything that quickly. Whether Crowdstrike can test anything that quickly, and whether you should allow live updates to be pushed to your production system, is a different question.
Anyway, who’s supposed to pay Microsoft for extensive testing of Crowdstrike’s driver? They’re paid to sign off on the fact that Crowdstrike are who they say they are, and at best that they’re not a deliberately malicious actor (as far as we know they aren’t). Third party Windows drivers have bugs and security vulnerabilities all the time, just like most software.
Finally, Crowdstrike to an extent competes with Microsoft’s own security products (i.e. Microsoft Defender and whatever the relevant enterprise-product branding is); we can’t expect Microsoft to invest too much in finding bugs in Crowdstrike!
And of course it’s impossible to formally prove that a program is correct.
That’s not true. This industry has a lot of experience formally proving correctness of mission-critical software, even in the previous century (Aegis missile cruiser fire control system is one example, software for Ariane rockets is another example, although the Ariane V maiden flight disaster is one of the illustrations that even a formal proof does not fully guarantee that one is safe; nevertheless having a formal proof tends to dramatically increase the level of safety).
But this kind of formal proofs is very expensive, it increases the cost of software at least an order of magnitude, and probably more than that. That’s why Zvi’s suggestion
Davidad: use LLMs to reimplement all kernel-mode software with formal verification.
How about we use human software engineers to do the rebuild, instead?
does not seem realistic. One does need heavy AI assist to bring the cost of making formally verified software to levels which are affordable.
It’s impossible to prove that an arbitrary program, which someone else gave you, is correct. That’s halting-problem equivalent, or Rice’s theorem, etc.
Yes, we can prove various properties of programs we carefully write to be provable, but the context here is that a black-box executable Crowdstrike submits to Microsoft cannot be proven reliable by Microsoft.
There are definitely improvements we can make. Counting just the ones made in some other (bits of) operating systems, we could:
Rewrite in a memory-safe language like Rust
Move more stuff to userspace. Drivers for e.g. USB devices can and should be written in userspace, using something like libusb. This goes for every device that doesn’t need performance-critical code or to manage device-side DMA access, which still leaves a bunch of things, but it’s a start.
Sandbox more kinds of drivers in a recoverable way, so they can do the things they need to efficiently access hardware, but are still prevented from accessing the rest of the kernel and userspace, and can ‘crash safe’. For example, Windows can recover from crashes in graphics drivers specifically—which is an amazing accomplishment! Linux eBPF can’t access stuff it shouldn’t.
Expose more kernel features via APIs so people don’t have to write drivers to do stuff that isn’t literally driving a piece of hardware, so even if Crowdstrike has super-duper-permissions, a crash in Crowdstrike itself doesn’t bring down the rest of the system, it has to do it intentionally
Of course any such changes both cost a lot and take years or decades to become ubiquitous. Windows in particular has an incredible backwards compatibility story, which in practice means backwards compatibility with every past bug they ever had. But this is a really valuable feature for many users who have old apps and, yes, drivers that rely on those bugs!
It’s impossible to prove that an arbitrary program, which someone else gave you, is correct.
That is, of course, true. The chances that an arbitrary program is correct are very close to zero, and one can’t prove a false statement. So one should not even try. An arbitrary program someone gave you is almost certainly incorrect.
The standard methodology for formally correct software is joint development of a program and of a proof of its correctness. One starts from a specification, and refines it into a proof and a program in parallel.
One can’t write a program, and then try to prove its correctness as an afterthought. The goal of having a formally verified software needs to be present from the start, and then there are methods to accomplish the task of creating this kind of software jointly with a proof of its correctness (but these methods are currently very labor-expensive).
(And yes, perhaps, Windows environment is too messy to deal with formally. Although one would think that fire control for fleet missile defense would be fairly messy as well, yet people claimed that they created a verified Ada code for that back in the 1990-s (or, perhaps, late 1980-s, I am not sure). The numbers they quoted back then during a mid-1990-s talk were 500 thousand lines of Ada and 50 million dollars (would be way more expensive today).)
It would specifically be impossible to prove the Crowdstrike driver safe because, by necessity, it regularly loads new data provided by Crowdstrike threat intelligence, and changes its behavior based on those updates.
Even if you could get the CS driver to refuse to load new updates without proving certain attributes of those updates, you would also need some kind of assurance of the characteristics of every other necessary part of the Windows OS, in every future update.
No, let’s keep in mind the Aegis fire control for missile defense example.
This is a highly variable situation, the “enemy action” can come in many forms, from multiple directions at once, the weather can change rapidly, the fleet to defend might have a variety of compositions and spatial distributions, and so on. So one deals with a lot of variable and unpredictable factors. Yet, they were able to formally establish some properties of that software, presumably to satisfaction of their DoD customers.
It does not mean that they have a full-proof system, but the reliability is likely much better because of that effort at formal verification of software.
With Windows, who knows. Perhaps it is even more complex than that. But formal methods are often able to account for a wide range of external situations and data. For a number of reasons, they nevertheless don’t provide full guarantee (there is this trap of thinking, “formally verified ⇒ absolutely safe”, it’s important not to get caught into that trap; “formally verified” just means “much more reliable in practice”).
I was trying to address a general point of whether a provably correct software is possible (obviously yes, since it is actually practiced occasionally for some mission-critical systems). I don’t know if it makes sense to have that in the context of Windows kernels. From what people recently seem to say about Windows is that Microsoft is saying that the European regulator forced them not to refuse CrowdStrike-like updates (so much for thinking, “what could or should be done in a sane world”).
Addendum: Crowdstrike also has MacOS and Linux products, and those are a useful comparison in the matter of whether we should be blaming Microsoft.
On MacOS they don’t have a kernel module (called a kext on MacOS). For two reasons; first, kexts are now disabled by default (I think you have to go to recovery mode to turn them on?) and second, the kernel provides APIs to accomplish most things without having to write a kext. So Crowdstrike doesn’t need to (hypothetically) guard against malicious kexts because those are not a threat nearly as much as malicious or plain buggy kernel drivers are on Windows.
One reason why this works well is that MacOS only supports a small first-party set of hardware, so they don’t need to allow a bunch of third party vendor drivers like Windows does. Microsoft can’t forbid third party kernel drivers, there are probably tens of thousands of legitimate ones that can’t be replaced easily or at all, even if someone was available to port old code to hypothetical new userland APIs. (Although Microsoft could provide much better userland APIs for new code; e.g. WinUSB seems to be very limited.)
(Note: I am not a Mac user and this part is not based on personal expertise.)
On Linux, Crowdstrike uses eBPF, which is a (relatively novel) domain-specific language for writing code that will execute inside the Linux kernel at runtime. eBPF is sandboxed in the kernel, and while it can (I think) crash it, it cannot e.g. access arbitrary memory. And so you can’t use eBPF to guard against malicious linux kernel modules.
This is indeed a superior approach, but it’s hard to blame Microsoft for not having an innovation in place that nobody had ten years ago and that hasn’t exactly replaced most preexisting drivers even on Linux, and removing support for custom drivers entirely on Windows would probably stop it from working on most of the hardware out there.
Then again, most linux systems aren’t running a hardened configuration, and getting userspace root access is game over anyway—the attacker can just install a new kernel for the next boot, if nothing else. To a first approximation, Linux systems are secure by configuration, not by architecture.
ETA: I’m seeing posts [0] that say Crowdstrike updates broke Linux installations, multiple times over the years. I don’t know how they did that specifically, but it doesn’t require a kernel module to make a machine crash or become unbootable… But I have not checked those particular reports.
Either way, no-one is up in arms saying to blame Linux for having a vulnerable kernel design!
Crowdstrike is absolutely not at the quality level of the core Windows kernel, and this is not a big surprise to experts.
I wonder if there are players in that space who are on that quality level.
Or would the only reliable solution be for Microsoft itself to provide this functionality as one of their own optional services? (Microsoft should hopefully be able to maintain the quality level on par with their own core Windows kernel.)
I’m sorry, this is wrong, and that everyone thinks so is also wrong—some people got this right.
Normal Windows kernel drivers are sandboxed to some extent. If a driver segfaults, it will be disabled on the next boot and the user informed; if that fails for some reason, you can tell the the computer to boot into ‘safe mode’, and if that fails, there is recovery mode. None of these options require the manual, tedious, error-prone recovery procedure that the Crowdstrike bug does.
This is because the Crowdstrike driver wants to protect you from malware in other drivers (i.e. malicious kernel modules). (ETA: And it wants to inspect and potentially block syscalls from user-space applications too.) So it runs in early-loading mode, before any other drivers, and it has special elevated privileges normal drivers do not get, in order to be able to inspect other drivers loaded later.
(ETA: I’ve seen claims that Crowdstrike also adds code to UEFI and maybe elsewhere to prevent anyone from disabling it and to reenable it; which is another reason the normal Windows failsafes for crashing drivers wouldn’t work. UEFI is, roughly, the system code running before your OS and bootloader.)
The fact that you cannot boot Windows if Crowdstrike does not approve it is not a bug in either Windows or Crowdstrike. It’s by design, and it’s an advertised feature! It’s supposed to make sure your boot process is safe, and if that fails, you’re not supposed to be able to boot!
Crowdstrike are to be blamed for releasing this bug—at all, without testing, without a rolling release, without a better fallback/recovery procedure, etc. Crowdstrike users who run mission-critical stuff like emergency response are also to blame for using Crowdstrike; as many people have pointed out, this is usually pure box-ticking security theater for the individuals making the choice whether to use Crowdstrike; and Crowdstrike is absolutely not at the quality level of the core Windows kernel, and this is not a big surprise to experts.
But Crowdstrike—the product, as advertised—cannot be built without the ability to render your computer unbootable, at least not without a radically different (micro)kernel design that no mainstream OS kernel in the world actually has. (A more reasonable approach these days would be to use a hypervisor, which can be much smaller and simpler and so easier to vet for bugs, as the recovery mechanism.)
Addendum 2: this particular quoted comment is very wrong, and I expect this is indicative of the quality of the quoted discussion, i.e. these people do not know what they are talking about.
The reports I have seen (of attempted reverse-engineering of the Crowdstrike driver’s segfault) say it did not attempt to execute the zeroes from the file as code, and the crash was unrelated, likely while trying to parse the file. Context: the original workaround for the problem was to delete a file which contains only zeroes (at least on some machines, reports are inconsistent), but there’s no direct reason to think the driver is trying to execute this file as code.
And: Windows does not run drivers “without a checksum”! Drivers have to be signed by Microsoft, and drivers with early-loading permissions have to be super-duper-signed in a way you probably can’t get just by paying them a few thousand dollars.
But it’s impossible to truly review or test a compiled binary, for which you have no sourcecode or even debug symbols, and which is deliberately obfuscated in many ways (as people have been reporting when they looked at this driver crash) because it’s trying to defend itself against reverse-engineers designing attacks. And of course it’s impossible to formally prove that a program is correct. And of course it’s written in a memory-unsafe language, i.e. C++, because every single OS kernel and its drivers are written in such a language.
Furthermore, the Crowdstrike product relies on very quickly pushing out updates to (everyone else’s) production to counter new threats / vulnerabilities being exploited. Microsoft can’t test anything that quickly. Whether Crowdstrike can test anything that quickly, and whether you should allow live updates to be pushed to your production system, is a different question.
Anyway, who’s supposed to pay Microsoft for extensive testing of Crowdstrike’s driver? They’re paid to sign off on the fact that Crowdstrike are who they say they are, and at best that they’re not a deliberately malicious actor (as far as we know they aren’t). Third party Windows drivers have bugs and security vulnerabilities all the time, just like most software.
Finally, Crowdstrike to an extent competes with Microsoft’s own security products (i.e. Microsoft Defender and whatever the relevant enterprise-product branding is); we can’t expect Microsoft to invest too much in finding bugs in Crowdstrike!
That’s not true. This industry has a lot of experience formally proving correctness of mission-critical software, even in the previous century (Aegis missile cruiser fire control system is one example, software for Ariane rockets is another example, although the Ariane V maiden flight disaster is one of the illustrations that even a formal proof does not fully guarantee that one is safe; nevertheless having a formal proof tends to dramatically increase the level of safety).
But this kind of formal proofs is very expensive, it increases the cost of software at least an order of magnitude, and probably more than that. That’s why Zvi’s suggestion
does not seem realistic. One does need heavy AI assist to bring the cost of making formally verified software to levels which are affordable.
It’s impossible to prove that an arbitrary program, which someone else gave you, is correct. That’s halting-problem equivalent, or Rice’s theorem, etc.
Yes, we can prove various properties of programs we carefully write to be provable, but the context here is that a black-box executable Crowdstrike submits to Microsoft cannot be proven reliable by Microsoft.
There are definitely improvements we can make. Counting just the ones made in some other (bits of) operating systems, we could:
Rewrite in a memory-safe language like Rust
Move more stuff to userspace. Drivers for e.g. USB devices can and should be written in userspace, using something like libusb. This goes for every device that doesn’t need performance-critical code or to manage device-side DMA access, which still leaves a bunch of things, but it’s a start.
Sandbox more kinds of drivers in a recoverable way, so they can do the things they need to efficiently access hardware, but are still prevented from accessing the rest of the kernel and userspace, and can ‘crash safe’. For example, Windows can recover from crashes in graphics drivers specifically—which is an amazing accomplishment! Linux eBPF can’t access stuff it shouldn’t.
Expose more kernel features via APIs so people don’t have to write drivers to do stuff that isn’t literally driving a piece of hardware, so even if Crowdstrike has super-duper-permissions, a crash in Crowdstrike itself doesn’t bring down the rest of the system, it has to do it intentionally
Of course any such changes both cost a lot and take years or decades to become ubiquitous. Windows in particular has an incredible backwards compatibility story, which in practice means backwards compatibility with every past bug they ever had. But this is a really valuable feature for many users who have old apps and, yes, drivers that rely on those bugs!
That is, of course, true. The chances that an arbitrary program is correct are very close to zero, and one can’t prove a false statement. So one should not even try. An arbitrary program someone gave you is almost certainly incorrect.
The standard methodology for formally correct software is joint development of a program and of a proof of its correctness. One starts from a specification, and refines it into a proof and a program in parallel.
One can’t write a program, and then try to prove its correctness as an afterthought. The goal of having a formally verified software needs to be present from the start, and then there are methods to accomplish the task of creating this kind of software jointly with a proof of its correctness (but these methods are currently very labor-expensive).
(And yes, perhaps, Windows environment is too messy to deal with formally. Although one would think that fire control for fleet missile defense would be fairly messy as well, yet people claimed that they created a verified Ada code for that back in the 1990-s (or, perhaps, late 1980-s, I am not sure). The numbers they quoted back then during a mid-1990-s talk were 500 thousand lines of Ada and 50 million dollars (would be way more expensive today).)
It would specifically be impossible to prove the Crowdstrike driver safe because, by necessity, it regularly loads new data provided by Crowdstrike threat intelligence, and changes its behavior based on those updates.
Even if you could get the CS driver to refuse to load new updates without proving certain attributes of those updates, you would also need some kind of assurance of the characteristics of every other necessary part of the Windows OS, in every future update.
No, let’s keep in mind the Aegis fire control for missile defense example.
This is a highly variable situation, the “enemy action” can come in many forms, from multiple directions at once, the weather can change rapidly, the fleet to defend might have a variety of compositions and spatial distributions, and so on. So one deals with a lot of variable and unpredictable factors. Yet, they were able to formally establish some properties of that software, presumably to satisfaction of their DoD customers.
It does not mean that they have a full-proof system, but the reliability is likely much better because of that effort at formal verification of software.
With Windows, who knows. Perhaps it is even more complex than that. But formal methods are often able to account for a wide range of external situations and data. For a number of reasons, they nevertheless don’t provide full guarantee (there is this trap of thinking, “formally verified ⇒ absolutely safe”, it’s important not to get caught into that trap; “formally verified” just means “much more reliable in practice”).
I was trying to address a general point of whether a provably correct software is possible (obviously yes, since it is actually practiced occasionally for some mission-critical systems). I don’t know if it makes sense to have that in the context of Windows kernels. From what people recently seem to say about Windows is that Microsoft is saying that the European regulator forced them not to refuse CrowdStrike-like updates (so much for thinking, “what could or should be done in a sane world”).
Sure, that’s fair enough. I was thinking in the context of “formal verification that would have prevented this outage.”
Addendum: Crowdstrike also has MacOS and Linux products, and those are a useful comparison in the matter of whether we should be blaming Microsoft.
On MacOS they don’t have a kernel module (called a kext on MacOS). For two reasons; first, kexts are now disabled by default (I think you have to go to recovery mode to turn them on?) and second, the kernel provides APIs to accomplish most things without having to write a kext. So Crowdstrike doesn’t need to (hypothetically) guard against malicious kexts because those are not a threat nearly as much as malicious or plain buggy kernel drivers are on Windows.
One reason why this works well is that MacOS only supports a small first-party set of hardware, so they don’t need to allow a bunch of third party vendor drivers like Windows does. Microsoft can’t forbid third party kernel drivers, there are probably tens of thousands of legitimate ones that can’t be replaced easily or at all, even if someone was available to port old code to hypothetical new userland APIs. (Although Microsoft could provide much better userland APIs for new code; e.g. WinUSB seems to be very limited.)
(Note: I am not a Mac user and this part is not based on personal expertise.)
On Linux, Crowdstrike uses eBPF, which is a (relatively novel) domain-specific language for writing code that will execute inside the Linux kernel at runtime. eBPF is sandboxed in the kernel, and while it can (I think) crash it, it cannot e.g. access arbitrary memory. And so you can’t use eBPF to guard against malicious linux kernel modules.
This is indeed a superior approach, but it’s hard to blame Microsoft for not having an innovation in place that nobody had ten years ago and that hasn’t exactly replaced most preexisting drivers even on Linux, and removing support for custom drivers entirely on Windows would probably stop it from working on most of the hardware out there.
Then again, most linux systems aren’t running a hardened configuration, and getting userspace root access is game over anyway—the attacker can just install a new kernel for the next boot, if nothing else. To a first approximation, Linux systems are secure by configuration, not by architecture.
ETA: I’m seeing posts [0] that say Crowdstrike updates broke Linux installations, multiple times over the years. I don’t know how they did that specifically, but it doesn’t require a kernel module to make a machine crash or become unbootable… But I have not checked those particular reports.
Either way, no-one is up in arms saying to blame Linux for having a vulnerable kernel design!
[0] https://news.ycombinator.com/item?id=41005936, and others.
Thanks!
I wonder if there are players in that space who are on that quality level.
Or would the only reliable solution be for Microsoft itself to provide this functionality as one of their own optional services? (Microsoft should hopefully be able to maintain the quality level on par with their own core Windows kernel.)