OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known
And my reply to this grew into something that I think is important enough to make as a top-level shortform post.
It’s worth noticing that this is not a universal property of high-paranoia software development, but a an unfortunate consequence of using the C programming language and of systems programming. In most programming languages and most application domains, crashes only rarely point to security problems. OpenBSD is this paranoid, and needs to be this paranoid, because its architecture is fundamentally unsound (albeit unsound in a way that all the other operating systems born in the same era are also unsound). This presents a number of useful analogies that may be useful for thinking about future AI architectural choices.
C has a couple of operations (use-after-free, buffer-overflow, and a few multithreading-related things) which expand false beliefs in one area of the system into major problems in seemingly-unrelated areas. The core mechanic of this is that, once you’ve corrupted a pointer or an array index, this generates opportunities to corrupt other things. Any memory-corruption attack surface you search through winds up yielding more opportunities to corrupt memory, in a supercritical way, eventually eventually yielding total control over the process and all its communication channels. If the process is an operating system kernel, there’s nothing left to do; if it’s, say, the renderer process of a web browser, then the attacker gets to leverage its communication channels to attack other processes, like the GPU driver and the compositor. This has the same sub-or-supercriticality dynamic.
Some security strategies try to keep there from being any entry points into the domain where there might be supercritically-expanding access: memory-safe languages, linters, code reviews. Call these entry-point strategies. Others try to drive down the criticality ratio: address space layout randomization, W^X, guard pages, stack guards, sandboxing. Call these mitigation strategies. In an AI-safety analogy, the entry-point strategies include things like decision theory, formal verification, and philosophical deconfusion; the mitigation strategies include things like neural-net transparency and ALBA.
Computer security is still, in an important sense, a failure: reasonably determined and competent attackers usually succeed. But by the metric “market price of a working exploit chain”, things do actually seem to be getting better, and both categories of strategies seem to have helped: compared to a decade ago, it’s both more difficult to find a potentially-exploitable bug, and also more difficult to turn a potentially-exploitable bug into a working exploit.
Unfortunately, while there are a number of ideas that seem like mitigation strategies for AI safety, it’s not clear if there are any metrics nearly as good as “market price of an exploit chain”. Still, we can come up with some candidates—not candidates we can precisely define or measure, currently, but candidates we can think in terms of, and maybe think about measuring in the future, like: how much optimization pressure can be applied to concepts, before perverse instantiations are found? How much control does an inner-optimizer needs to start with, in order to take over an outer optimization process? I don’t know how to increase these, but it seems like a potentially promising research direction.
It’s worth noticing that this is not a universal property of high-paranoia software development, but a an unfortunate consequence of using the C programming language and of systems programming. In most programming languages and most application domains, crashes only rarely point to security problems.
I disagree. While C is indeed terribly unsafe, it is always the case that a safety-critical system exhibiting behaviour you thought impossible is a serious safety risk—because it means that your understanding of the system is wrong, and that includes the safety properties.
In a comment here, Eliezer observed that:
And my reply to this grew into something that I think is important enough to make as a top-level shortform post.
It’s worth noticing that this is not a universal property of high-paranoia software development, but a an unfortunate consequence of using the C programming language and of systems programming. In most programming languages and most application domains, crashes only rarely point to security problems. OpenBSD is this paranoid, and needs to be this paranoid, because its architecture is fundamentally unsound (albeit unsound in a way that all the other operating systems born in the same era are also unsound). This presents a number of useful analogies that may be useful for thinking about future AI architectural choices.
C has a couple of operations (use-after-free, buffer-overflow, and a few multithreading-related things) which expand false beliefs in one area of the system into major problems in seemingly-unrelated areas. The core mechanic of this is that, once you’ve corrupted a pointer or an array index, this generates opportunities to corrupt other things. Any memory-corruption attack surface you search through winds up yielding more opportunities to corrupt memory, in a supercritical way, eventually eventually yielding total control over the process and all its communication channels. If the process is an operating system kernel, there’s nothing left to do; if it’s, say, the renderer process of a web browser, then the attacker gets to leverage its communication channels to attack other processes, like the GPU driver and the compositor. This has the same sub-or-supercriticality dynamic.
Some security strategies try to keep there from being any entry points into the domain where there might be supercritically-expanding access: memory-safe languages, linters, code reviews. Call these entry-point strategies. Others try to drive down the criticality ratio: address space layout randomization, W^X, guard pages, stack guards, sandboxing. Call these mitigation strategies. In an AI-safety analogy, the entry-point strategies include things like decision theory, formal verification, and philosophical deconfusion; the mitigation strategies include things like neural-net transparency and ALBA.
Computer security is still, in an important sense, a failure: reasonably determined and competent attackers usually succeed. But by the metric “market price of a working exploit chain”, things do actually seem to be getting better, and both categories of strategies seem to have helped: compared to a decade ago, it’s both more difficult to find a potentially-exploitable bug, and also more difficult to turn a potentially-exploitable bug into a working exploit.
Unfortunately, while there are a number of ideas that seem like mitigation strategies for AI safety, it’s not clear if there are any metrics nearly as good as “market price of an exploit chain”. Still, we can come up with some candidates—not candidates we can precisely define or measure, currently, but candidates we can think in terms of, and maybe think about measuring in the future, like: how much optimization pressure can be applied to concepts, before perverse instantiations are found? How much control does an inner-optimizer needs to start with, in order to take over an outer optimization process? I don’t know how to increase these, but it seems like a potentially promising research direction.
I disagree. While C is indeed terribly unsafe, it is always the case that a safety-critical system exhibiting behaviour you thought impossible is a serious safety risk—because it means that your understanding of the system is wrong, and that includes the safety properties.