There’s a few barriers which this runs into, but I’m going to talk about one particular barrier which seems especially intractable for the entire class of overseer-based strategies.
Suppose we take analogies like “second species” at face value for a minute, and consider species driven to extinction by humans (as an analogy for the extinction risk posed to humans by AI). How and why do humans drive species to extinction? In some cases the species is hunted to extinction, either because it’s a threat or because it’s economically profitable to hunt. But I would guess that in 99+% of cases, the humans drive a species to extinction because the humans are doing something that changes the species’ environment a lot, without specifically trying to keep the species alive. (Think DDT, deforestation, that sort of thing.)
Assuming this metaphor carries over to AI, what kind of extinction risk will AI pose?
Well, the extinction risk will not come from AI actively trying to kill the humans. The AI will just be doing some big thing which happens to involve changing the environment a lot (like making replicators or a lot of computronium or even just designing a fusion power generator), and then humans die as a side-effect. Collateral damage happens by default when something changes the environment in big ways.
What does this mean for oversight? Well, it means that there wouldn’t necessarily be any point at which the AI is actually thinking about killing humans or whatever. It just doesn’t think much about the humans at all, and then the humans get wrecked by side effects. In order for an overseer to raise an alarm, the overseer would have to figure out itself that the AI’s plans will kill the humans, i.e. the overseer would have to itself predict the consequences of a presumably-very-complicated plan.
Now, for killing humans specifically, you could maybe patch over this by e.g. explicitly asking the AI to think about whether its plans will kill lots of humans. But at this point your “non-general patch detector” should be going off; there are countless other ways the AIs plans could destroy massive amounts of human-value, and we are not in fact going to think to ask all the right questions ahead of time. And oversight will not be able to tell us which questions we didn’t think to ask, because the AI isn’t bothering to think those questions either (just like it didn’t bother to think about whether humans would die from its plans). And then we get into qualitatively different strategies-which-won’t-work like “ask the AI which questions we should ask”, whose failure modes are another topic.
The main thing I want to address with this research strategy is language models using reasoning that we would not approve of, which could run through convergent instrumental goals like self-preservation, goal-preservation, power-seeking, etc. It doesn’t seem to me that the failure mode you’ve described depends on the AI doing reasoning of which we wouldn’t approve. Even if this research direction were wildly successful, there would be many other failure modes for AI; I’m just trying to address this particularly pernicious one.
It’s possible that the transparency provided by authentic, externalized reasoning could also be useful for reducing other dangers associated with powerful AI, but that’s not the main thrust of the research direction I’m presenting here.
Thanks, this is a helpful comment—can you elaborate on why “ask the AI which questions we should ask” would fail (or point to relevant discussion)? I’m thinking that we would use many different models (not just the model doing the reasoning), including smaller ones, and trained or prompted in different ways, to catch generated text that would cause harmful side effects. We could have all of these models use externalized reasoning as well, to help aid in the supervision/oversight. This obviously doesn’t eliminate all of the risk, since all of the models can coordinate to cause catastrophic side effects; that said, I’d guess that coordination is much harder when you’re using many different models in different ways, and these models can only communicate via externalized reasoning
“Ask the AI which questions we should ask” is strategically isomorphic to “ask the AI what we should do”. It’s mainly a values question, and only works insofar as we’ve already thoroughly solved the alignment problem for the system at hand. Like, getting a good answer to “ask the AI what we should do” (or “ask the AI which questions we should ask”) is basically a framing of the whole alignment problem.
Do you already have cached understanding for why “ask the AI what we should do” doesn’t work? If not, and assuming you have already brainstormed on your own for at least an hour, List of Lethalities has multiple. (Indeed, IIRC the large majority of the failure modes on List of Lethalities apply.)
Yes, I have brainstormed about this question for quite a while, and I have read the List of Lethalities. I was mainly wondering if you had a specific concern with the proposal I outlined and/or additional concerns beyond that List.
The most relevant points from that list in this context I think are (34) and (35), that the agents competing against each other can coordinate in a way that doesn’t involve or benefit humanity. My main uncertainty in this context is how the different agents and supervisors could learn to coordinate, when they can only communicate via human-readable text. They could:
Use steganography to communicate. Here, I’m unsure on how they’d agree on a protocol for communication if the agent and supervisor don’t have shared weights. I’m sure there’s some loophole I’m missing due to lack of imagination
Use the text channel itself to communicate to other agents, risking getting caught by human overseers, but relying on the fact that there’s too much communication for us to directly monitor. Here, I can see this risk being mitigated by us using smaller models that are less likely to be power-seekers in overseeing the text communication between model for signs of coordination. The smaller models will be less capable but could e.g. be amplified with decomposition/debate
Curious if/why you think these strategies are likely to fail!
I don’t think those are the most relevant points (although I can see why they’re the most salient). I’d say sections B.2-B.3 are the real dealbreaker for most proposals. In particular:
18: There’s no reliable Cartesian-sensory ground truth about how good the AI’s answers are. There just isn’t a reliable training signal.
20: Human raters make systematic errors—regular, compactly describable, predictable errors. So we can’t use human ratings to figure out which outputs are good without selecting for outputs which exploit human biases.
27: Optimizing against an interpreted thought optimizes against interpretability. So whatever oversight we’re using, we can’t optimize against it; if the oversight shows the AI doing something bad, we can’t just give negative reward to those interpretable-bad-thoughts.
29: Human beings cannot inspect an AGI’s output to determine whether the consequences will be good. This is similar to 18 & 20; it means we don’t have a ground truth training signal for “good answers”. (My comment at the top of this thread essentially argues that this same problem also applies to the AI’s internal thoughts, not just its outputs.)
32, which talks about text-based oversight in particular.
The bottom line of all of these is: even setting aside inner alignment failures, we just do not have an outer training signal for reliably good answers.
Now, the usual way to get around that problem is to have the system mimic humans. Then, the argument goes, the system’s answers will be about as good as the humans would have figured out anyway, maybe given a bit more time. (Not a lot more time, because extrapolating too far into the future is itself failure-prone, since we push further and further out of the training distribution as we ask for hypothetical text from further and further into the future.) This is a fine plan, but inherently limited in terms of how much it buys us. It mostly works in worlds where we were already close to solving the problem ourselves, and the further we were from solving it, the less likely that the human-mimicking AI will close the gap. E.g. if I already have most of the pieces, then GPT-6 is much more likely to generate a workable solution when prompted to give “John Wentworth’s Solution To The Alignment Problem” from the year 2035, whereas if I don’t already have most of the pieces then that is much less likely to work. And that’s true even if I will eventually figure out the pieces; the problem is that the pieces I haven’t figured out yet aren’t in the training distribution. So mostly, the way to increase the chances of that strategy working is by getting closer to solving the problem ourselves.
I definitely agree with you that it’s insufficient to stamp out thoughts about actively harming humans. We also need the AI to positively value human life, safety, and freedom. But your “non-general patch detector” argument seems weak to me. We can provide lots of different examples of cases where the AI ought to be thinking about human welfare, do adversarial training on it, etc., and it seems plausible to me that eventually it would just generalize to caring about humans overall, in any situation. I don’t see why this is an especially hard generalization problem.
See List of Lethalities, numbers 21 and 22 (also the rest of section B.2, but especially those two). Unlike Eliezer, I do think there’s a nontrivial chance that your proposal here would work (it’s basically invoking Alignment by Default), but I think it’s a pretty small chance (like, ~10%), and Eliezer’s proposed failure modes are probably basically what actually happens at a high level.
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.
The good news for us is that this is a much more solvable problem, because under the assumption that the AI weakly cares about less powerful beings like us, then it’s asymptotically nice: With more technology, you merely need to emulate animals and us in a simulated reality via mind uploading, and presto, the problem is solved. In other words, capabilities helps us a lot more than in the case where the AI really does want to kill all humans.
There’s a few barriers which this runs into, but I’m going to talk about one particular barrier which seems especially intractable for the entire class of overseer-based strategies.
Suppose we take analogies like “second species” at face value for a minute, and consider species driven to extinction by humans (as an analogy for the extinction risk posed to humans by AI). How and why do humans drive species to extinction? In some cases the species is hunted to extinction, either because it’s a threat or because it’s economically profitable to hunt. But I would guess that in 99+% of cases, the humans drive a species to extinction because the humans are doing something that changes the species’ environment a lot, without specifically trying to keep the species alive. (Think DDT, deforestation, that sort of thing.)
Assuming this metaphor carries over to AI, what kind of extinction risk will AI pose?
Well, the extinction risk will not come from AI actively trying to kill the humans. The AI will just be doing some big thing which happens to involve changing the environment a lot (like making replicators or a lot of computronium or even just designing a fusion power generator), and then humans die as a side-effect. Collateral damage happens by default when something changes the environment in big ways.
What does this mean for oversight? Well, it means that there wouldn’t necessarily be any point at which the AI is actually thinking about killing humans or whatever. It just doesn’t think much about the humans at all, and then the humans get wrecked by side effects. In order for an overseer to raise an alarm, the overseer would have to figure out itself that the AI’s plans will kill the humans, i.e. the overseer would have to itself predict the consequences of a presumably-very-complicated plan.
Now, for killing humans specifically, you could maybe patch over this by e.g. explicitly asking the AI to think about whether its plans will kill lots of humans. But at this point your “non-general patch detector” should be going off; there are countless other ways the AIs plans could destroy massive amounts of human-value, and we are not in fact going to think to ask all the right questions ahead of time. And oversight will not be able to tell us which questions we didn’t think to ask, because the AI isn’t bothering to think those questions either (just like it didn’t bother to think about whether humans would die from its plans). And then we get into qualitatively different strategies-which-won’t-work like “ask the AI which questions we should ask”, whose failure modes are another topic.
The main thing I want to address with this research strategy is language models using reasoning that we would not approve of, which could run through convergent instrumental goals like self-preservation, goal-preservation, power-seeking, etc. It doesn’t seem to me that the failure mode you’ve described depends on the AI doing reasoning of which we wouldn’t approve. Even if this research direction were wildly successful, there would be many other failure modes for AI; I’m just trying to address this particularly pernicious one.
It’s possible that the transparency provided by authentic, externalized reasoning could also be useful for reducing other dangers associated with powerful AI, but that’s not the main thrust of the research direction I’m presenting here.
Thanks, this is a helpful comment—can you elaborate on why “ask the AI which questions we should ask” would fail (or point to relevant discussion)? I’m thinking that we would use many different models (not just the model doing the reasoning), including smaller ones, and trained or prompted in different ways, to catch generated text that would cause harmful side effects. We could have all of these models use externalized reasoning as well, to help aid in the supervision/oversight. This obviously doesn’t eliminate all of the risk, since all of the models can coordinate to cause catastrophic side effects; that said, I’d guess that coordination is much harder when you’re using many different models in different ways, and these models can only communicate via externalized reasoning
“Ask the AI which questions we should ask” is strategically isomorphic to “ask the AI what we should do”. It’s mainly a values question, and only works insofar as we’ve already thoroughly solved the alignment problem for the system at hand. Like, getting a good answer to “ask the AI what we should do” (or “ask the AI which questions we should ask”) is basically a framing of the whole alignment problem.
Do you already have cached understanding for why “ask the AI what we should do” doesn’t work? If not, and assuming you have already brainstormed on your own for at least an hour, List of Lethalities has multiple. (Indeed, IIRC the large majority of the failure modes on List of Lethalities apply.)
Yes, I have brainstormed about this question for quite a while, and I have read the List of Lethalities. I was mainly wondering if you had a specific concern with the proposal I outlined and/or additional concerns beyond that List.
The most relevant points from that list in this context I think are (34) and (35), that the agents competing against each other can coordinate in a way that doesn’t involve or benefit humanity. My main uncertainty in this context is how the different agents and supervisors could learn to coordinate, when they can only communicate via human-readable text. They could:
Use steganography to communicate. Here, I’m unsure on how they’d agree on a protocol for communication if the agent and supervisor don’t have shared weights. I’m sure there’s some loophole I’m missing due to lack of imagination
Use the text channel itself to communicate to other agents, risking getting caught by human overseers, but relying on the fact that there’s too much communication for us to directly monitor. Here, I can see this risk being mitigated by us using smaller models that are less likely to be power-seekers in overseeing the text communication between model for signs of coordination. The smaller models will be less capable but could e.g. be amplified with decomposition/debate
Curious if/why you think these strategies are likely to fail!
I don’t think those are the most relevant points (although I can see why they’re the most salient). I’d say sections B.2-B.3 are the real dealbreaker for most proposals. In particular:
18: There’s no reliable Cartesian-sensory ground truth about how good the AI’s answers are. There just isn’t a reliable training signal.
20: Human raters make systematic errors—regular, compactly describable, predictable errors. So we can’t use human ratings to figure out which outputs are good without selecting for outputs which exploit human biases.
27: Optimizing against an interpreted thought optimizes against interpretability. So whatever oversight we’re using, we can’t optimize against it; if the oversight shows the AI doing something bad, we can’t just give negative reward to those interpretable-bad-thoughts.
29: Human beings cannot inspect an AGI’s output to determine whether the consequences will be good. This is similar to 18 & 20; it means we don’t have a ground truth training signal for “good answers”. (My comment at the top of this thread essentially argues that this same problem also applies to the AI’s internal thoughts, not just its outputs.)
32, which talks about text-based oversight in particular.
The bottom line of all of these is: even setting aside inner alignment failures, we just do not have an outer training signal for reliably good answers.
Now, the usual way to get around that problem is to have the system mimic humans. Then, the argument goes, the system’s answers will be about as good as the humans would have figured out anyway, maybe given a bit more time. (Not a lot more time, because extrapolating too far into the future is itself failure-prone, since we push further and further out of the training distribution as we ask for hypothetical text from further and further into the future.) This is a fine plan, but inherently limited in terms of how much it buys us. It mostly works in worlds where we were already close to solving the problem ourselves, and the further we were from solving it, the less likely that the human-mimicking AI will close the gap. E.g. if I already have most of the pieces, then GPT-6 is much more likely to generate a workable solution when prompted to give “John Wentworth’s Solution To The Alignment Problem” from the year 2035, whereas if I don’t already have most of the pieces then that is much less likely to work. And that’s true even if I will eventually figure out the pieces; the problem is that the pieces I haven’t figured out yet aren’t in the training distribution. So mostly, the way to increase the chances of that strategy working is by getting closer to solving the problem ourselves.
I definitely agree with you that it’s insufficient to stamp out thoughts about actively harming humans. We also need the AI to positively value human life, safety, and freedom. But your “non-general patch detector” argument seems weak to me. We can provide lots of different examples of cases where the AI ought to be thinking about human welfare, do adversarial training on it, etc., and it seems plausible to me that eventually it would just generalize to caring about humans overall, in any situation. I don’t see why this is an especially hard generalization problem.
See List of Lethalities, numbers 21 and 22 (also the rest of section B.2, but especially those two). Unlike Eliezer, I do think there’s a nontrivial chance that your proposal here would work (it’s basically invoking Alignment by Default), but I think it’s a pretty small chance (like, ~10%), and Eliezer’s proposed failure modes are probably basically what actually happens at a high level.
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.
The good news for us is that this is a much more solvable problem, because under the assumption that the AI weakly cares about less powerful beings like us, then it’s asymptotically nice: With more technology, you merely need to emulate animals and us in a simulated reality via mind uploading, and presto, the problem is solved. In other words, capabilities helps us a lot more than in the case where the AI really does want to kill all humans.