“As you take the system and make it vastly superintelligent, your primary focus needs to be on security from adversarial forces, rather than primarily on making something that’s useful.”
I agree if you assume a discrete action that simply causes the system to become vastly superintelligent. But we can try not to get to powerful adversarial optimization in the first place; if that never happens then you never need the security.
I certainly agree that in the presence of powerful adversarial optimizers, you need security to get your system to do what you want. However, we can just not build powerful adversarial optimizers. My preferred solution is to make sure our AI systems are trying to do what we want , so that they never become adversarial in the first place. But if for some reason we can’t do that, then we could make sure AI systems don’t become too powerful, or not build them at all. It seems very weird to instead say “well, the AI system is going to be adversarial and way more powerful, let’s figure out how to make it secure”—that should be the last approach, if none of the other approaches work out.
The latter summary in particular sounds superficially like Eliezer’s proposed approach, except that he doesn’t think it’s easy in the AGI regime to “just not build powerful adversarial optimizers” (and if he suspected this was easy, he wouldn’t want to build in the assumption that it’s easy as a prerequisite for a safety approach working; he would want a safety approach that’s robust to the scenario where it’s easy to accidentally end up with vastly more quality-adjusted optimization than intended).
The “do alignment in a way that doesn’t break if capability gain suddenly speeds up” approach, or at least Eliezer’s version of that approach, similarly emphasizes “you’re screwed (in the AGI regime) if you build powerful adversarial optimizers, and it’s a silly idea to do that in the first place, so just don’t do it, ever, in any context”. From AI Safety Mindset:
Niceness as the first line of defense / not relying on defeating a superintelligent adversary
[...] Paraphrasing Schneier, we might say that there’s three kinds of security in the world: Security that prevents your little brother from reading your files, security that prevents major governments from reading your files, and security that prevents superintelligences from getting what they want. We can then go on to remark that the third kind of security is unobtainable, and even if we had it, it would be very hard for us to know we had it. Maybe superintelligences can make themselves knowably secure against other superintelligences, but we can’t do that and know that we’ve done it.
[...] The final component of an AI safety mindset is one that doesn’t have a strong analogue in traditional computer security, and it is the rule of not ending up facing a transhuman adversary in the first place. The winning move is not to play. Much of the field of value alignment theory is about going to any length necessary to avoid needing to outwit the AI.
In AI safety, the first line of defense is an AI that does not want to hurt you. If you try to put the AI in an explosive-laced concrete bunker, that may or may not be a sensible and cost-effective precaution in case the first line of defense turns out to be flawed. But the first line of defense should always be an AI that doesn’t want to hurt you or avert your other safety measures, rather than the first line of defense being a clever plan to prevent a superintelligence from getting what it wants.
A special case of this mindset applied to AI safety is the Omni Test—would this AI hurt us (or want to defeat other safety measures) if it were omniscient and omnipotent? If it would, then we’ve clearly built the wrong AI, because we are the ones laying down the algorithm and there’s no reason to build an algorithm that hurts us period. If an agent design fails the Omni Test desideratum, this means there are scenarios that it prefers over the set of all scenarios we find acceptable, and the agent may go searching for ways to bring about those scenarios.
If the agent is searching for possible ways to bring about undesirable ends, then we, the AI programmers, are already spending computing power in an undesirable way. We shouldn’t have the AI running a search that will hurt us if it comes up positive, even if we expect the search to come up empty. We just shouldn’t program a computer that way; it’s a foolish and self-destructive thing to do with computing power. Building an AI that would hurt us if omnipotent is a bug for the same reason that a NASA probe crashing if all seven other planets line up would be a bug—the system just isn’t supposed to behave that way period; we should not rely on our own cleverness to reason about whether it’s likely to happen.
Suppose your AI suddenly became omniscient and omnipotent—suddenly knew all facts and could directly ordain any outcome as a policy option. Would the executing AI code lead to bad outcomes in that case? If so, why did you write a program that in some sense ‘wanted’ to hurt you and was only held in check by lack of knowledge and capability? Isn’t that a bad way for you to configure computing power? Why not write different code instead?
The Omni Test is that an advanced AI should be expected to remain aligned, or not lead to catastrophic outcomes, or fail safely, even if it suddenly knows all facts and can directly ordain any possible outcome as an immediate choice. The policy proposal is that, among agents meant to act in the rich real world, any predicted behavior where the agent might act destructively if given unlimited power (rather than e.g. pausing for a safe user query) should be treated as a bug.
[...] No aspect of the AI’s design should ever put us in an adversarial position vis-a-vis the AI, or pit the AI’s wits against our wits. If a computation starts looking for a way to outwit us, then the design and methodology has already failed. We just shouldn’t be putting an AI in a box and then having the AI search for ways to get out of the box. If you’re building a toaster, you don’t build one element that heats the toast and then add a tiny refrigerator that cools down the toast.
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”. It’s not just (or even primarily) a fallback measure to keep you safe on the off chance your system does generate a powerful adversary. Quoting Nate:
Lastly, alignment looks difficult for the same reason computer security is difficult: systems need to be robust to intelligent searches for loopholes.
Suppose you have a dozen different vulnerabilities in your code, none of which is itself fatal or even really problematic in ordinary settings. Security is difficult because you need to account for intelligent attackers who might find all twelve vulnerabilities and chain them together in a novel way to break into (or just break) your system. Failure modes that would never arise by accident can be sought out and exploited; weird and extreme contexts can be instantiated by an attacker to cause your code to follow some crazy code path that you never considered.
A similar sort of problem arises with AI. The problem I’m highlighting here is not that AI systems might act adversarially: AI alignment as a research program is all about finding ways to prevent adversarial behavior before it can crop up. We don’t want to be in the business of trying to outsmart arbitrarily intelligent adversaries. That’s a losing game.
The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized. Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.
It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.
Scott Garrabrant mentioned to me at one point that he thought Optimization Amplifies distills a (maybe the?) core idea in Security Mindset and Ordinary Paranoia. The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”, not from “lots of adversarial optimization” (even though the latter is a likely consequence of getting things badly wrong with the former).
Eliezer models most of the difficulty (and most of the security-relatedness) of the alignment problem as lying in ‘get ourselves to a place where in fact our systems don’t end up as powerful adversarial optimizers’, rather than (a) treating this as a gimme and focusing on what we should do absent such optimizers, or (b) treating the presence of adversarial optimization as inevitable and asking how to manage it.
I think this idea (“avoiding generating powerful adversarial optimizers is an enormous constraint and requires balancing on a knife’s edge between disaster and irrelevance”) is also behind the view that system safety largely comes from things like “the system can’t think about any topics, or try to solve any cognitive problems, other than the ones we specifically want it to”, vs. Rohin’s “the system is trying to do what we want”.
Tbc, I wasn’t modeling Eliezer / Nate / MIRI as saying “there will be powerful adversarial optimization, and so we need security”—it is in fact quite clear that we’re all aiming for “no powerful adversarial optimization in the first place”. I was responding to the arguments in this post.
(and if he suspected this was easy, he wouldn’t want to build in the assumption that it’s easy as a prerequisite for a safety approach working; he’d want a safety approach that’s robust to the scenario where it’s easy to accidentally end up with vastly more quality-adjusted optimization than intended).
I agree that’s desirable all else equal, but such an approach would likely require more time and effort (a lot more time and effort on my model). It’s an empirical question whether we have that time / effort to spare (and also perhaps it’s better to get AGI earlier to e.g. reduce other x-risks, in which case the marginal safety from not relying on the assumption may not be worth it).
(I mentioned coordination on not building AGI above—I think that might be feasible if the “global epistemic state” was that building AGI is likely to kill us all, but seems quite infeasible if our epistemic state is “everything we know suggests this will work, but it could fail if we somehow end up with more optimization”.)
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”.
[...]
The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized. Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.
It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.
[...]
The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”
Thanks, this is helpful for understanding MIRI’s position better. (I probably should have figured it out from Nate and Scott’s posts, but I don’t think I actually did.)
Broadly, my hope is that we actually see non-existentially-catastrophic failures caused by AIs going down “the paths you weren’t visualizing”, and this causes you to start visualizing the path. Obviously all else equal it’s better if we visualize it in the first place.
I think I also have a different picture of what powerful optimization will look like—the paperclip maximizer doesn’t seem like a good model for the sort of thing we’re likely to build. An approach based on some explicitly represented goal is going to be dead in the water well before it becomes even human-level intelligent, because it will ignore “common sense rules” again and again (c.f. most specification gaming examples). Instead, our AI systems are going to need to understand common sense rules somehow, and the resulting system is not going to look like it’s ruthlessly pursuing some simple goal.
For example, the resulting system may be more accurately modeled as having uncertainty about the goal (whether explicitly represented or internally learned). Weird + extreme states tend to only be good for a few goals, and so would not be good choices if you’re uncertain about the goal. In addition, if our AI systems are learning from our conventions, then they will likely pick up our risk aversion, which also tends to prevent weird + extreme states.
Finally, it seems like there’s a broad basin of corrigibility, that prevents weird + extreme states that humans would rate as bad. It’s not hard to figure out that humans don’t want to die, so any weird + extreme state you create has to respect that constraint. And there are many other such easy-to-learn constraints.
Eliezer strongly believes that discrete jumps will happen
to
Eliezer believes you should generally not be making assumptions like “Oh I’m sure discrete jumps in capabilities won’t happen”, for the same reasons a security expert would not accept anything of the general form “Oh I’m sure nothing like that will ever happen” as a reasoning step in making sure a system is secure.
I mean, I guess it’s obvious if you’ve read the security mindset dialogue, but I hadn’t realised that was a central element to the capabilities gain debate.
Added: To clarify further: Eliezer has said that explicitly a few times, but only now did I realise it was potentially a deep crux of the broader disagreement between approaches. I thought it was just a helpful but not especially key example of not taking assumptions about AI systems.
(3) “Rapid capability gain and large capability differences”,
(A) superhuman intelligence makes things break that don’t break at infrahuman levels,
(B) “you have to get [important parts of] the design right the first time”,
(C) “if something goes wrong at any level of abstraction, there may be powerful cognitive processes seeking out flaws and loopholes in your safety measures”, and the meta-level
(D) “these problems don’t show up in qualitatively the same way when people are pursuing their immediate incentives to get today’s machine learning systems working today”.
Eliezer: [...] I think that artificial general intelligence capabilities, once they exist, are going to scale too fast for that to be a useful way to look at the problem. AlphaZero going from 0 to 120 mph in four hours or a day—that is not out of the question here. And even if it’s a year, a year is still a very short amount of time for things to scale up.
[...] I’d say this is a thesis of capability gain. This is a thesis of how fast artificial general intelligence gains in power once it starts to be around, whether we’re looking at 20 years (in which case this scenario does not happen) or whether we’re looking at something closer to the speed at which Go was developed (in which case it does happen) or the speed at which AlphaZero went from 0 to 120 and better-than-human (in which case there’s a bit of an issue that you better prepare for in advance, because you’re not going to have very long to prepare for it once it starts to happen).
[...] Why do I think that? It’s not that simple. I mean, I think a lot of people who see the power of intelligence will already find that pretty intuitive, but if you don’t, then you should read my paper Intelligence Explosion Microeconomics about returns on cognitive reinvestment. It goes through things like the evolution of human intelligence and how the logic of evolutionary biology tells us that when human brains were increasing in size, there were increasing marginal returns to fitness relative to the previous generations for increasing brain size. Which means that it’s not the case that as you scale intelligence, it gets harder and harder to buy. It’s not the case that as you scale intelligence, you need exponentially larger brains to get linear improvements.
At least something slightly like the opposite of this is true; and we can tell this by looking at the fossil record and using some logic, but that’s not simple.
Sam: Comparing ourselves to chimpanzees works. We don’t have brains that are 40 times the size or 400 times the size of chimpanzees, and yet what we’re doing—I don’t know what measure you would use, but it exceeds what they’re doing by some ridiculous factor.
Eliezer: And I find that convincing, but other people may want additional details.
[...] AlphaZero seems to me like a genuine case in point. That is showing us that capabilities that in humans require a lot of tweaking and that human civilization built up over centuries of masters teaching students how to play Go, and that no individual human could invent in isolation… [...] AlphaZero blew past all of that in less than a day, starting from scratch, without looking at any of the games that humans played, without looking at any of the theories that humans had about Go, without looking at any of the accumulated knowledge that we had, and without very much in the way of special-case code for Go rather than chess—in fact, zero special-case code for Go rather than chess. And that in turn is an example that refutes another thesis about how artificial general intelligence develops slowly and gradually, which is: “Well, it’s just one mind; it can’t beat our whole civilization.”
I would say that there’s a bunch of technical arguments which you walk through, and then after walking through these arguments you assign a bunch of probability, maybe not certainty, to artificial general intelligence that scales in power very fast—a year or less. And in this situation, if alignment is technically difficult, if it is easy to screw up, if it requires a bunch of additional effort—in this scenario, if we have an arms race between people who are trying to get their AGI first by doing a little bit less safety because from their perspective that only drops the probability a little; and then someone else is like, “Oh no, we have to keep up. We need to strip off the safety work too. Let’s strip off a bit more so we can get in the front.”—if you have this scenario, and by a miracle the first people to cross the finish line have actually not screwed up and they actually have a functioning powerful artificial general intelligence that is able to prevent the world from ending, you have to prevent the world from ending. You are in a terrible, terrible situation. You’ve got your one miracle. And this follows from the rapid capability gain thesis and at least the current landscape for how these things are developing.
The question is simply “Can we do cognition of this quality at all?”[...] The speed and quantity of cognition isn’t the big issue, getting to that quality at all is the question. Once you’re there, you can solve any problem which can realistically be done with non-exponentially-vast amounts of that exact kind of cognition.
Re:
And:
The latter summary in particular sounds superficially like Eliezer’s proposed approach, except that he doesn’t think it’s easy in the AGI regime to “just not build powerful adversarial optimizers” (and if he suspected this was easy, he wouldn’t want to build in the assumption that it’s easy as a prerequisite for a safety approach working; he would want a safety approach that’s robust to the scenario where it’s easy to accidentally end up with vastly more quality-adjusted optimization than intended).
The “do alignment in a way that doesn’t break if capability gain suddenly speeds up” approach, or at least Eliezer’s version of that approach, similarly emphasizes “you’re screwed (in the AGI regime) if you build powerful adversarial optimizers, and it’s a silly idea to do that in the first place, so just don’t do it, ever, in any context”. From AI Safety Mindset:
Omnipotence Test for AI Safety:
Non-Adversarial Principle:
Cf. the “X-and-only-X” problem.
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”. It’s not just (or even primarily) a fallback measure to keep you safe on the off chance your system does generate a powerful adversary. Quoting Nate:
Scott Garrabrant mentioned to me at one point that he thought Optimization Amplifies distills a (maybe the?) core idea in Security Mindset and Ordinary Paranoia. The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”, not from “lots of adversarial optimization” (even though the latter is a likely consequence of getting things badly wrong with the former).
Eliezer models most of the difficulty (and most of the security-relatedness) of the alignment problem as lying in ‘get ourselves to a place where in fact our systems don’t end up as powerful adversarial optimizers’, rather than (a) treating this as a gimme and focusing on what we should do absent such optimizers, or (b) treating the presence of adversarial optimization as inevitable and asking how to manage it.
I think this idea (“avoiding generating powerful adversarial optimizers is an enormous constraint and requires balancing on a knife’s edge between disaster and irrelevance”) is also behind the view that system safety largely comes from things like “the system can’t think about any topics, or try to solve any cognitive problems, other than the ones we specifically want it to”, vs. Rohin’s “the system is trying to do what we want”.
Tbc, I wasn’t modeling Eliezer / Nate / MIRI as saying “there will be powerful adversarial optimization, and so we need security”—it is in fact quite clear that we’re all aiming for “no powerful adversarial optimization in the first place”. I was responding to the arguments in this post.
I agree that’s desirable all else equal, but such an approach would likely require more time and effort (a lot more time and effort on my model). It’s an empirical question whether we have that time / effort to spare (and also perhaps it’s better to get AGI earlier to e.g. reduce other x-risks, in which case the marginal safety from not relying on the assumption may not be worth it).
(I mentioned coordination on not building AGI above—I think that might be feasible if the “global epistemic state” was that building AGI is likely to kill us all, but seems quite infeasible if our epistemic state is “everything we know suggests this will work, but it could fail if we somehow end up with more optimization”.)
Thanks, this is helpful for understanding MIRI’s position better. (I probably should have figured it out from Nate and Scott’s posts, but I don’t think I actually did.)
Broadly, my hope is that we actually see non-existentially-catastrophic failures caused by AIs going down “the paths you weren’t visualizing”, and this causes you to start visualizing the path. Obviously all else equal it’s better if we visualize it in the first place.
I think I also have a different picture of what powerful optimization will look like—the paperclip maximizer doesn’t seem like a good model for the sort of thing we’re likely to build. An approach based on some explicitly represented goal is going to be dead in the water well before it becomes even human-level intelligent, because it will ignore “common sense rules” again and again (c.f. most specification gaming examples). Instead, our AI systems are going to need to understand common sense rules somehow, and the resulting system is not going to look like it’s ruthlessly pursuing some simple goal.
For example, the resulting system may be more accurately modeled as having uncertainty about the goal (whether explicitly represented or internally learned). Weird + extreme states tend to only be good for a few goals, and so would not be good choices if you’re uncertain about the goal. In addition, if our AI systems are learning from our conventions, then they will likely pick up our risk aversion, which also tends to prevent weird + extreme states.
Finally, it seems like there’s a broad basin of corrigibility, that prevents weird + extreme states that humans would rate as bad. It’s not hard to figure out that humans don’t want to die, so any weird + extreme state you create has to respect that constraint. And there are many other such easy-to-learn constraints.
Oh this is very interesting. I’m updating from
to
I mean, I guess it’s obvious if you’ve read the security mindset dialogue, but I hadn’t realised that was a central element to the capabilities gain debate.
Added: To clarify further: Eliezer has said that explicitly a few times, but only now did I realise it was potentially a deep crux of the broader disagreement between approaches. I thought it was just a helpful but not especially key example of not taking assumptions about AI systems.
Eliezer also strongly believes that discrete jumps will happen. But the crux for him AFAIK is absolute capability and absolute speed of capability gain in AGI systems, not discontinuity per se (and not particular methods for improving capability, like recursive self-improvement). Hence in So Far: Unfriendly AI Edition Eliezer lists his key claims as:
(1) “Orthogonality thesis”,
(2) “Instrumental convergence”,
(3) “Rapid capability gain and large capability differences”,
(A) superhuman intelligence makes things break that don’t break at infrahuman levels,
(B) “you have to get [important parts of] the design right the first time”,
(C) “if something goes wrong at any level of abstraction, there may be powerful cognitive processes seeking out flaws and loopholes in your safety measures”, and the meta-level
(D) “these problems don’t show up in qualitatively the same way when people are pursuing their immediate incentives to get today’s machine learning systems working today”.
From Sam Harris’ interview of Eliezer (emphasis added):
See also: