If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work was kind of useless, because it missed the hard parts of aligning superintelligence.
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work, like this, was kind of useless, because it missed the hard parts of aligning superintelligence.
I agree some people in the MIRI-sphere did say this, and a few of them get credit for pointing out things in this vicinity, but I personally don’t remember reading many strong statements of the form:
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
My understanding is that a lot of the time the claim was instead something like:
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
As some evidence, I’d point to Rob Bensinger’s statement that,
I don’t think Eliezer’s criticism of the field [of prosaic alignment] is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’.
I do also think a number of people on LW sometimes said a milder version of the thing I mentioned above, which was something like:
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
I think that’s much more an example of
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
than of
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
This doesn’t seem to be the same thing as what I was talking about.
Yes, people frequently criticized particular schemes for aligning AI systems, arguing that the scheme doesn’t address some key perceived obstacle. By itself, this is pretty different from predicting both:
It will be easy to get behavioral alignment on slightly-sub-AGI, and maybe even par-human systems, including on shutdown problems
The problem is that these schemes don’t scale well all the way to radical superintelligence.
I remember a lot of people making the second point, but not nearly as many making the first point.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
A side-note to this conversation, but I basically still buy the quoted text and don’t think it now looks false in hindsight.
We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally, entails having goals that are robust to changing circumstances. I don’t know if that’s true, but regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they were prompted to do in a specific situation is also something that I don’t know.)
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others.
Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive, it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agent’s goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.
To be clear “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work was kind of useless, because it missed the hard parts of aligning superintelligence.
I agree some people in the MIRI-sphere did say this, and a few of them get credit for pointing out things in this vicinity, but I personally don’t remember reading many strong statements of the form:
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
My understanding is that a lot of the time the claim was instead something like:
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
As some evidence, I’d point to Rob Bensinger’s statement that,
I do also think a number of people on LW sometimes said a milder version of the thing I mentioned above, which was something like:
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
I think that’s much more an example of
than of
This doesn’t seem to be the same thing as what I was talking about.
Yes, people frequently criticized particular schemes for aligning AI systems, arguing that the scheme doesn’t address some key perceived obstacle. By itself, this is pretty different from predicting both:
It will be easy to get behavioral alignment on slightly-sub-AGI, and maybe even par-human systems, including on shutdown problems
The problem is that these schemes don’t scale well all the way to radical superintelligence.
I remember a lot of people making the second point, but not nearly as many making the first point.
I think I’m missing you then.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
Two points:
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
From the article...
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
A side-note to this conversation, but I basically still buy the quoted text and don’t think it now looks false in hindsight.
We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally, entails having goals that are robust to changing circumstances. I don’t know if that’s true, but regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they were prompted to do in a specific situation is also something that I don’t know.)
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others.
Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive, it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agent’s goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.
To be clear “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.