If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
I think I’m missing you then.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
Two points:
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
From the article...
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.