I gave part of my answer in the thread where you first asked this question. Here’s the rest.
TLDR: Value alignment is too hard even without the value stability problem. Goal-misspecification is too likely (I realize I don’t know the best ref for this other than LoL—anyone else have a better central ref?). Therefore we’ll very likely align our first AGIs to follow instructions, and use that as a stepping-stone to full value alignment.
This is something I used to worry about a lot. Now it’s something I don’t worry about it at all.
I still think this would be a very challenging problem if we were designing a value-aligned autonomous AGI. Now I don’t think we’re going to do that.
I now see goal mis-specification as a very hard problem, and one we don’t need to tackle to create autonomous AGI or even superintelligence. Therefore I think we won’t.
It’s counterintuitive to think of a highly intelligent and fully autonomous being that wants more than anything to do what a less intelligent human tells them to do. But I think it’s completely possible, and a much safer option for our first AGIs.
This is much simpler than trying to instill our values with such accuracy that we’d be happy with the result. Neither showing examples of things we like (as in RL training) nor explicitly stating our values in natural language seems likely to be accurate enough after it’s been interpreted by a superintelligent AGI that is likely to see the world at least somewhat differently than we do. That sort of re-interpretation is functionally similar to value drift, although it’s separable. Adding the problem of actual value drift on top of the dangers of goal misspecification just makes things worse.
Aligning an AGI to follow instructions isn’t trivial either, but it’s a lot easier to specify than getting values right and stable. For instance, LLMs already largely “know” what people tend to mean by instructions—and that’s before the checking phase of do what I mean and check (DWIMAC).
Primarily, though, instruction-following has the enormous advantage of allowing for corrigibility—you can tell your AGI to shut down to accept changes, or issue new revised instructions if/when you realize (likely because you asked the AGI) that your instructions would be interpreted differently than you’d like.
If that works and we get superhuman AGI aligned to follow instructions, we’ll probably want to use that AGI to help us solve the problem of full value alignment, including solving value drift. We won’t want to launch an autonomous AGI that’s not corrigible/instruction-following until we’re really sure our AGIs have a sure solution. (This is assuming we have those AGIs controlled by humans who are ethical enough to release control of the future into better hands once they’re available—a big if).
I gave part of my answer in the thread where you first asked this question. Here’s the rest.
TLDR: Value alignment is too hard even without the value stability problem. Goal-misspecification is too likely (I realize I don’t know the best ref for this other than LoL—anyone else have a better central ref?). Therefore we’ll very likely align our first AGIs to follow instructions, and use that as a stepping-stone to full value alignment.
This is something I used to worry about a lot. Now it’s something I don’t worry about it at all.
I wrote a paper on this, Goal changes in intelligent agents back in 2018 for a small FLI grant, (in perhaps the first round of public funds for AGI x-risk). One of my first posts on LW was The alignment stability problem.
I still think this would be a very challenging problem if we were designing a value-aligned autonomous AGI. Now I don’t think we’re going to do that.
I now see goal mis-specification as a very hard problem, and one we don’t need to tackle to create autonomous AGI or even superintelligence. Therefore I think we won’t.
Instead we’ll make the central goal of our first AGIs to follow instructions or to be corrigible (correctable).
It’s counterintuitive to think of a highly intelligent and fully autonomous being that wants more than anything to do what a less intelligent human tells them to do. But I think it’s completely possible, and a much safer option for our first AGIs.
This is much simpler than trying to instill our values with such accuracy that we’d be happy with the result. Neither showing examples of things we like (as in RL training) nor explicitly stating our values in natural language seems likely to be accurate enough after it’s been interpreted by a superintelligent AGI that is likely to see the world at least somewhat differently than we do. That sort of re-interpretation is functionally similar to value drift, although it’s separable. Adding the problem of actual value drift on top of the dangers of goal misspecification just makes things worse.
Aligning an AGI to follow instructions isn’t trivial either, but it’s a lot easier to specify than getting values right and stable. For instance, LLMs already largely “know” what people tend to mean by instructions—and that’s before the checking phase of do what I mean and check (DWIMAC).
Primarily, though, instruction-following has the enormous advantage of allowing for corrigibility—you can tell your AGI to shut down to accept changes, or issue new revised instructions if/when you realize (likely because you asked the AGI) that your instructions would be interpreted differently than you’d like.
If that works and we get superhuman AGI aligned to follow instructions, we’ll probably want to use that AGI to help us solve the problem of full value alignment, including solving value drift. We won’t want to launch an autonomous AGI that’s not corrigible/instruction-following until we’re really sure our AGIs have a sure solution. (This is assuming we have those AGIs controlled by humans who are ethical enough to release control of the future into better hands once they’re available—a big if).