Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
What are people’s timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we’ll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I’m very curious if others disagree here though
I personally have pretty broad error bars; I think it’s plausible enough that AI won’t help with automating alignment that it’s still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they’ll inevitably be trying to kill you. I don’t think he’s really explained this view, and I’ve never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I’m not totally sure how much it comes from Eliezer but I’d guess that’s a lot of it.
I think part of Eliezer’s views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it’s tied up with Eliezer’s views about how to build AGI closely enough that Eliezer won’t want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
Similar remarks apply to interpreting and answering “What will be its effect on _?” It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here.
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden’s proposal would fail.
Of course getting an AI to tell you what it’s really thinking may be hard (and indeed I think it’s hard enough that I think there’s a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it’s hard (or at least I’ve often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won’t be helpful until it’s too late, so far it doesn’t seem like adjacent intuitions have stood the test of time well.
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment...
facepalm
It won’t show signs of deceptive alignment. The entire point of “deception” is not showing signs. Unless it’s just really incompetent at deception, there won’t be signs; the lack of signs is not significant evidence of a lack of deception.
There may be other reasons to think our models are not yet deceptive to any significant extent (I certainly don’t think they are), but the lack of signs of deception is not one of them.
I understand that deceptive models won’t show signs of deception :) That’s why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
What are people’s timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we’ll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I’m very curious if others disagree here though
I personally have pretty broad error bars; I think it’s plausible enough that AI won’t help with automating alignment that it’s still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they’ll inevitably be trying to kill you. I don’t think he’s really explained this view, and I’ve never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I’m not totally sure how much it comes from Eliezer but I’d guess that’s a lot of it.
I think part of Eliezer’s views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it’s tied up with Eliezer’s views about how to build AGI closely enough that Eliezer won’t want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden’s proposal would fail.
Of course getting an AI to tell you what it’s really thinking may be hard (and indeed I think it’s hard enough that I think there’s a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it’s hard (or at least I’ve often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won’t be helpful until it’s too late, so far it doesn’t seem like adjacent intuitions have stood the test of time well.
Your link redirects back to this page. The quote is from one of Eliezer’s comments in Reply to Holden on Tool AI.
Thanks, fixed.
facepalm
It won’t show signs of deceptive alignment. The entire point of “deception” is not showing signs. Unless it’s just really incompetent at deception, there won’t be signs; the lack of signs is not significant evidence of a lack of deception.
There may be other reasons to think our models are not yet deceptive to any significant extent (I certainly don’t think they are), but the lack of signs of deception is not one of them.
I understand that deceptive models won’t show signs of deception :) That’s why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.