Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”. It’s not just (or even primarily) a fallback measure to keep you safe on the off chance your system does generate a powerful adversary. Quoting Nate:
Lastly, alignment looks difficult for the same reason computer security is difficult: systems need to be robust to intelligent searches for loopholes.
Suppose you have a dozen different vulnerabilities in your code, none of which is itself fatal or even really problematic in ordinary settings. Security is difficult because you need to account for intelligent attackers who might find all twelve vulnerabilities and chain them together in a novel way to break into (or just break) your system. Failure modes that would never arise by accident can be sought out and exploited; weird and extreme contexts can be instantiated by an attacker to cause your code to follow some crazy code path that you never considered.
A similar sort of problem arises with AI. The problem I’m highlighting here is not that AI systems might act adversarially: AI alignment as a research program is all about finding ways to prevent adversarial behavior before it can crop up. We don’t want to be in the business of trying to outsmart arbitrarily intelligent adversaries. That’s a losing game.
The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized. Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.
It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.
Scott Garrabrant mentioned to me at one point that he thought Optimization Amplifies distills a (maybe the?) core idea in Security Mindset and Ordinary Paranoia. The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”, not from “lots of adversarial optimization” (even though the latter is a likely consequence of getting things badly wrong with the former).
Eliezer models most of the difficulty (and most of the security-relatedness) of the alignment problem as lying in ‘get ourselves to a place where in fact our systems don’t end up as powerful adversarial optimizers’, rather than (a) treating this as a gimme and focusing on what we should do absent such optimizers, or (b) treating the presence of adversarial optimization as inevitable and asking how to manage it.
I think this idea (“avoiding generating powerful adversarial optimizers is an enormous constraint and requires balancing on a knife’s edge between disaster and irrelevance”) is also behind the view that system safety largely comes from things like “the system can’t think about any topics, or try to solve any cognitive problems, other than the ones we specifically want it to”, vs. Rohin’s “the system is trying to do what we want”.
Tbc, I wasn’t modeling Eliezer / Nate / MIRI as saying “there will be powerful adversarial optimization, and so we need security”—it is in fact quite clear that we’re all aiming for “no powerful adversarial optimization in the first place”. I was responding to the arguments in this post.
(and if he suspected this was easy, he wouldn’t want to build in the assumption that it’s easy as a prerequisite for a safety approach working; he’d want a safety approach that’s robust to the scenario where it’s easy to accidentally end up with vastly more quality-adjusted optimization than intended).
I agree that’s desirable all else equal, but such an approach would likely require more time and effort (a lot more time and effort on my model). It’s an empirical question whether we have that time / effort to spare (and also perhaps it’s better to get AGI earlier to e.g. reduce other x-risks, in which case the marginal safety from not relying on the assumption may not be worth it).
(I mentioned coordination on not building AGI above—I think that might be feasible if the “global epistemic state” was that building AGI is likely to kill us all, but seems quite infeasible if our epistemic state is “everything we know suggests this will work, but it could fail if we somehow end up with more optimization”.)
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”.
[...]
The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized. Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.
It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.
[...]
The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”
Thanks, this is helpful for understanding MIRI’s position better. (I probably should have figured it out from Nate and Scott’s posts, but I don’t think I actually did.)
Broadly, my hope is that we actually see non-existentially-catastrophic failures caused by AIs going down “the paths you weren’t visualizing”, and this causes you to start visualizing the path. Obviously all else equal it’s better if we visualize it in the first place.
I think I also have a different picture of what powerful optimization will look like—the paperclip maximizer doesn’t seem like a good model for the sort of thing we’re likely to build. An approach based on some explicitly represented goal is going to be dead in the water well before it becomes even human-level intelligent, because it will ignore “common sense rules” again and again (c.f. most specification gaming examples). Instead, our AI systems are going to need to understand common sense rules somehow, and the resulting system is not going to look like it’s ruthlessly pursuing some simple goal.
For example, the resulting system may be more accurately modeled as having uncertainty about the goal (whether explicitly represented or internally learned). Weird + extreme states tend to only be good for a few goals, and so would not be good choices if you’re uncertain about the goal. In addition, if our AI systems are learning from our conventions, then they will likely pick up our risk aversion, which also tends to prevent weird + extreme states.
Finally, it seems like there’s a broad basin of corrigibility, that prevents weird + extreme states that humans would rate as bad. It’s not hard to figure out that humans don’t want to die, so any weird + extreme state you create has to respect that constraint. And there are many other such easy-to-learn constraints.
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”. It’s not just (or even primarily) a fallback measure to keep you safe on the off chance your system does generate a powerful adversary. Quoting Nate:
Scott Garrabrant mentioned to me at one point that he thought Optimization Amplifies distills a (maybe the?) core idea in Security Mindset and Ordinary Paranoia. The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”, not from “lots of adversarial optimization” (even though the latter is a likely consequence of getting things badly wrong with the former).
Eliezer models most of the difficulty (and most of the security-relatedness) of the alignment problem as lying in ‘get ourselves to a place where in fact our systems don’t end up as powerful adversarial optimizers’, rather than (a) treating this as a gimme and focusing on what we should do absent such optimizers, or (b) treating the presence of adversarial optimization as inevitable and asking how to manage it.
I think this idea (“avoiding generating powerful adversarial optimizers is an enormous constraint and requires balancing on a knife’s edge between disaster and irrelevance”) is also behind the view that system safety largely comes from things like “the system can’t think about any topics, or try to solve any cognitive problems, other than the ones we specifically want it to”, vs. Rohin’s “the system is trying to do what we want”.
Tbc, I wasn’t modeling Eliezer / Nate / MIRI as saying “there will be powerful adversarial optimization, and so we need security”—it is in fact quite clear that we’re all aiming for “no powerful adversarial optimization in the first place”. I was responding to the arguments in this post.
I agree that’s desirable all else equal, but such an approach would likely require more time and effort (a lot more time and effort on my model). It’s an empirical question whether we have that time / effort to spare (and also perhaps it’s better to get AGI earlier to e.g. reduce other x-risks, in which case the marginal safety from not relying on the assumption may not be worth it).
(I mentioned coordination on not building AGI above—I think that might be feasible if the “global epistemic state” was that building AGI is likely to kill us all, but seems quite infeasible if our epistemic state is “everything we know suggests this will work, but it could fail if we somehow end up with more optimization”.)
Thanks, this is helpful for understanding MIRI’s position better. (I probably should have figured it out from Nate and Scott’s posts, but I don’t think I actually did.)
Broadly, my hope is that we actually see non-existentially-catastrophic failures caused by AIs going down “the paths you weren’t visualizing”, and this causes you to start visualizing the path. Obviously all else equal it’s better if we visualize it in the first place.
I think I also have a different picture of what powerful optimization will look like—the paperclip maximizer doesn’t seem like a good model for the sort of thing we’re likely to build. An approach based on some explicitly represented goal is going to be dead in the water well before it becomes even human-level intelligent, because it will ignore “common sense rules” again and again (c.f. most specification gaming examples). Instead, our AI systems are going to need to understand common sense rules somehow, and the resulting system is not going to look like it’s ruthlessly pursuing some simple goal.
For example, the resulting system may be more accurately modeled as having uncertainty about the goal (whether explicitly represented or internally learned). Weird + extreme states tend to only be good for a few goals, and so would not be good choices if you’re uncertain about the goal. In addition, if our AI systems are learning from our conventions, then they will likely pick up our risk aversion, which also tends to prevent weird + extreme states.
Finally, it seems like there’s a broad basin of corrigibility, that prevents weird + extreme states that humans would rate as bad. It’s not hard to figure out that humans don’t want to die, so any weird + extreme state you create has to respect that constraint. And there are many other such easy-to-learn constraints.