(Speaking just for myself in this comment, not the other authors)
I still feel like the comments on your post are pretty relevant, but to summarize my current position:
AIs that actively think about deceiving us (e.g. to escape human oversight of the compute cluster they are running on) come well before (in capability ordering, not necessarily calendar time) AIs that are free enough from human-imposed constraints and powerful enough in their effects on the world that they can wipe out humanity + achieve their goals without thinking about how to deal with humans.
In situations where there is some meaningful human-imposed constraint (e.g. the AI starts out running on a data center that humans can turn off), if you don’t think about deceiving humans at all, you choose plans that ask humans to help you with your undesirable goals, causing them to stop you. So, in these situations, x-risk stories require deception.
It seems kinda unlikely that even the AI free from human-imposed constraints like off switches doesn’t think about humans at all. For example, it probably needs to think about other AI systems that might oppose it, including the possibility that humans build such other AI systems (which is best intervened on by ensuring the humans don’t build those AI systems).
Responding to this in particular:
The key thing I’m pointing to here is that the consequentialist power-seeking deception story has a bunch of extra assumptions in it, and we still get a disaster with those assumptions relaxed, so naively it seems like we should assign more probability to a story with fewer assumptions.
The least conjunctive story for doom is “doom happens”. Obviously this is not very useful. We need more details in order to find solutions. When adding an additional concrete detail, you generally want that detail to (a) capture lots of probability mass and (b) provide some angle of attack for solutions.
For (a): based on the points above I’d guess maybe 20:1 odds on “x-risk via misalignment with explicit deception” : “x-risk via misalignment without explicit deception” in our actual world. (Obviously “x-risk via misalignment” is going to be the sum of these and so higher than each one individually.)
For (b): the “explicit deception” detail is particularly useful to get an angle of attack on the problem. It allows us to assume that the AI “knows” that the thing it is doing is not what its designers intended, which suggests that what we need to do to avoid this class of scenarios is to find some way of getting that knowledge out of the AI system (rather than, say, solving all of human values and imbuing it into the AI).
One response is “but even if you solve the explicit deception case, then you just get x-risk via misalignment without explicit deception, so you didn’t actually save any worlds”. My response would be that P(x-risk via misalignment without explicit deception | no x-risk via misalignment with explicit deception) seems pretty low to me. But that seems like the main way someone could change my mind here.
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I’m using “intelligence” here to point to something including ability to “actually try” as opposed to symbolically “try”, effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn’t be surprised if something within the human distribution would suffice.
Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don’t like by organizing a small protest at their university, vs someone who responds to a law they don’t like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That’s not even my mental picture of −2sd vs +3sd; I’d think that’s more like +1sd vs +3sd. A −2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the “find the right bureaucrat/case law” person, as the “find the right bureaucrat/case law” person is compared to the “protest” person.
Second probable crux: there’s two importantly-different notions of “thinking about humans” or “thinking about deceiving humans” here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn’t hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-”explicit” habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
Is that “explicit deception”? I dunno, it seems like “explicit deception” is drawing the wrong boundary. But when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts. It’s not that it’s “thinking about deceiving humans”, so much as “thinking in ways which are selected for deceiving humans”.
(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it.
This sounds roughly right to me, but I don’t see why this matters to our disagreement?
For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
This also sounds plausible to me (though it isn’t clear to me how exactly doom happens). For me the relevant question is “could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge”, and I think the answer is still yes.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I don’t quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you’re trying to point to here, I think I basically agree with what you’re gesturing at.
But unless that distinction is central to what you’re trying to point to here
Yeah, I don’t think it’s central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).
think about how humans most often deceive other humans: we do it mainly by deceiving ourselves… when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it’s theoretically possible to identify deception, even in cases of self-deception.
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)
This. Without really high levels of capabilities afforded by quantum/reversible computers which is an additional assumption, you can’t really win without explicitly modeling humans and deceiving them.
(Speaking just for myself in this comment, not the other authors)
I still feel like the comments on your post are pretty relevant, but to summarize my current position:
AIs that actively think about deceiving us (e.g. to escape human oversight of the compute cluster they are running on) come well before (in capability ordering, not necessarily calendar time) AIs that are free enough from human-imposed constraints and powerful enough in their effects on the world that they can wipe out humanity + achieve their goals without thinking about how to deal with humans.
In situations where there is some meaningful human-imposed constraint (e.g. the AI starts out running on a data center that humans can turn off), if you don’t think about deceiving humans at all, you choose plans that ask humans to help you with your undesirable goals, causing them to stop you. So, in these situations, x-risk stories require deception.
It seems kinda unlikely that even the AI free from human-imposed constraints like off switches doesn’t think about humans at all. For example, it probably needs to think about other AI systems that might oppose it, including the possibility that humans build such other AI systems (which is best intervened on by ensuring the humans don’t build those AI systems).
Responding to this in particular:
The least conjunctive story for doom is “doom happens”. Obviously this is not very useful. We need more details in order to find solutions. When adding an additional concrete detail, you generally want that detail to (a) capture lots of probability mass and (b) provide some angle of attack for solutions.
For (a): based on the points above I’d guess maybe 20:1 odds on “x-risk via misalignment with explicit deception” : “x-risk via misalignment without explicit deception” in our actual world. (Obviously “x-risk via misalignment” is going to be the sum of these and so higher than each one individually.)
For (b): the “explicit deception” detail is particularly useful to get an angle of attack on the problem. It allows us to assume that the AI “knows” that the thing it is doing is not what its designers intended, which suggests that what we need to do to avoid this class of scenarios is to find some way of getting that knowledge out of the AI system (rather than, say, solving all of human values and imbuing it into the AI).
One response is “but even if you solve the explicit deception case, then you just get x-risk via misalignment without explicit deception, so you didn’t actually save any worlds”. My response would be that P(x-risk via misalignment without explicit deception | no x-risk via misalignment with explicit deception) seems pretty low to me. But that seems like the main way someone could change my mind here.
Two probable cruxes here...
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I’m using “intelligence” here to point to something including ability to “actually try” as opposed to symbolically “try”, effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn’t be surprised if something within the human distribution would suffice.
Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don’t like by organizing a small protest at their university, vs someone who responds to a law they don’t like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That’s not even my mental picture of −2sd vs +3sd; I’d think that’s more like +1sd vs +3sd. A −2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the “find the right bureaucrat/case law” person, as the “find the right bureaucrat/case law” person is compared to the “protest” person.
Second probable crux: there’s two importantly-different notions of “thinking about humans” or “thinking about deceiving humans” here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn’t hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-”explicit” habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
Is that “explicit deception”? I dunno, it seems like “explicit deception” is drawing the wrong boundary. But when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts. It’s not that it’s “thinking about deceiving humans”, so much as “thinking in ways which are selected for deceiving humans”.
(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)
This sounds roughly right to me, but I don’t see why this matters to our disagreement?
This also sounds plausible to me (though it isn’t clear to me how exactly doom happens). For me the relevant question is “could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge”, and I think the answer is still yes.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I don’t quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you’re trying to point to here, I think I basically agree with what you’re gesturing at.
Yeah, I don’t think it’s central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it’s theoretically possible to identify deception, even in cases of self-deception.
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)
This. Without really high levels of capabilities afforded by quantum/reversible computers which is an additional assumption, you can’t really win without explicitly modeling humans and deceiving them.