First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I’m using “intelligence” here to point to something including ability to “actually try” as opposed to symbolically “try”, effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn’t be surprised if something within the human distribution would suffice.
Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don’t like by organizing a small protest at their university, vs someone who responds to a law they don’t like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That’s not even my mental picture of −2sd vs +3sd; I’d think that’s more like +1sd vs +3sd. A −2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the “find the right bureaucrat/case law” person, as the “find the right bureaucrat/case law” person is compared to the “protest” person.
Second probable crux: there’s two importantly-different notions of “thinking about humans” or “thinking about deceiving humans” here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn’t hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-”explicit” habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
Is that “explicit deception”? I dunno, it seems like “explicit deception” is drawing the wrong boundary. But when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts. It’s not that it’s “thinking about deceiving humans”, so much as “thinking in ways which are selected for deceiving humans”.
(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it.
This sounds roughly right to me, but I don’t see why this matters to our disagreement?
For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
This also sounds plausible to me (though it isn’t clear to me how exactly doom happens). For me the relevant question is “could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge”, and I think the answer is still yes.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I don’t quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you’re trying to point to here, I think I basically agree with what you’re gesturing at.
But unless that distinction is central to what you’re trying to point to here
Yeah, I don’t think it’s central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).
think about how humans most often deceive other humans: we do it mainly by deceiving ourselves… when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it’s theoretically possible to identify deception, even in cases of self-deception.
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)
Two probable cruxes here...
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I’m using “intelligence” here to point to something including ability to “actually try” as opposed to symbolically “try”, effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn’t be surprised if something within the human distribution would suffice.
Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don’t like by organizing a small protest at their university, vs someone who responds to a law they don’t like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That’s not even my mental picture of −2sd vs +3sd; I’d think that’s more like +1sd vs +3sd. A −2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the “find the right bureaucrat/case law” person, as the “find the right bureaucrat/case law” person is compared to the “protest” person.
Second probable crux: there’s two importantly-different notions of “thinking about humans” or “thinking about deceiving humans” here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn’t hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-”explicit” habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
Is that “explicit deception”? I dunno, it seems like “explicit deception” is drawing the wrong boundary. But when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts. It’s not that it’s “thinking about deceiving humans”, so much as “thinking in ways which are selected for deceiving humans”.
(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)
This sounds roughly right to me, but I don’t see why this matters to our disagreement?
This also sounds plausible to me (though it isn’t clear to me how exactly doom happens). For me the relevant question is “could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge”, and I think the answer is still yes.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I don’t quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you’re trying to point to here, I think I basically agree with what you’re gesturing at.
Yeah, I don’t think it’s central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it’s theoretically possible to identify deception, even in cases of self-deception.
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)