I personally have pretty broad error bars; I think it’s plausible enough that AI won’t help with automating alignment that it’s still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they’ll inevitably be trying to kill you. I don’t think he’s really explained this view, and I’ve never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I’m not totally sure how much it comes from Eliezer but I’d guess that’s a lot of it.
I think part of Eliezer’s views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it’s tied up with Eliezer’s views about how to build AGI closely enough that Eliezer won’t want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
Similar remarks apply to interpreting and answering “What will be its effect on _?” It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here.
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden’s proposal would fail.
Of course getting an AI to tell you what it’s really thinking may be hard (and indeed I think it’s hard enough that I think there’s a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it’s hard (or at least I’ve often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won’t be helpful until it’s too late, so far it doesn’t seem like adjacent intuitions have stood the test of time well.
I personally have pretty broad error bars; I think it’s plausible enough that AI won’t help with automating alignment that it’s still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they’ll inevitably be trying to kill you. I don’t think he’s really explained this view, and I’ve never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I’m not totally sure how much it comes from Eliezer but I’d guess that’s a lot of it.
I think part of Eliezer’s views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it’s tied up with Eliezer’s views about how to build AGI closely enough that Eliezer won’t want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden’s proposal would fail.
Of course getting an AI to tell you what it’s really thinking may be hard (and indeed I think it’s hard enough that I think there’s a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it’s hard (or at least I’ve often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won’t be helpful until it’s too late, so far it doesn’t seem like adjacent intuitions have stood the test of time well.
Your link redirects back to this page. The quote is from one of Eliezer’s comments in Reply to Holden on Tool AI.
Thanks, fixed.