I am grateful to Noosphere89 and Seth Herd for prompting me to start this discussion.
In human evolution, the fundamental goal was to survive and reproduce, passing on our genes to the next generation. But somewhere along the way, humans developed complex societies and technologies that go beyond just survival and reproduction. Now, we often find ourselves pursuing goals that aren’t always aligned with basic evolutionary drives. In fact, we’ve become worse at our evolutionary goal than our ancestors (birth rates are at an all time low).
Even if we manage to align our first AGI with human goals, how can we ensure that 1) it doesn’t drift from these goals and that 2) it doesn’t create an AGI which drifts from these goals (kind of how we drifted from the goals of our ancestors)? What are the current proposals for solving these issues?
I gave part of my answer in the thread where you first asked this question. Here’s the rest.
TLDR: Value alignment is too hard even without the value stability problem. Goal-misspecification is too likely (I realize I don’t know the best ref for this other than LoL—anyone else have a better central ref?). Therefore we’ll very likely align our first AGIs to follow instructions, and use that as a stepping-stone to full value alignment.
This is something I used to worry about a lot. Now it’s something I don’t worry about it at all.
I wrote a paper on this, Goal changes in intelligent agents back in 2018 for a small FLI grant, (in perhaps the first round of public funds for AGI x-risk). One of my first posts on LW was The alignment stability problem.
I still think this would be a very challenging problem if we were designing a value-aligned autonomous AGI. Now I don’t think we’re going to do that.
I now see goal mis-specification as a very hard problem, and one we don’t need to tackle to create autonomous AGI or even superintelligence. Therefore I think we won’t.
Instead we’ll make the central goal of our first AGIs to follow instructions or to be corrigible (correctable).
It’s counterintuitive to think of a highly intelligent and fully autonomous being that wants more than anything to do what a less intelligent human tells them to do. But I think it’s completely possible, and a much safer option for our first AGIs.
This is much simpler than trying to instill our values with such accuracy that we’d be happy with the result. Neither showing examples of things we like (as in RL training) nor explicitly stating our values in natural language seems likely to be accurate enough after it’s been interpreted by a superintelligent AGI that is likely to see the world at least somewhat differently than we do. That sort of re-interpretation is functionally similar to value drift, although it’s separable. Adding the problem of actual value drift on top of the dangers of goal misspecification just makes things worse.
Aligning an AGI to follow instructions isn’t trivial either, but it’s a lot easier to specify than getting values right and stable. For instance, LLMs already largely “know” what people tend to mean by instructions—and that’s before the checking phase of do what I mean and check (DWIMAC).
Primarily, though, instruction-following has the enormous advantage of allowing for corrigibility—you can tell your AGI to shut down to accept changes, or issue new revised instructions if/when you realize (likely because you asked the AGI) that your instructions would be interpreted differently than you’d like.
If that works and we get superhuman AGI aligned to follow instructions, we’ll probably want to use that AGI to help us solve the problem of full value alignment, including solving value drift. We won’t want to launch an autonomous AGI that’s not corrigible/instruction-following until we’re really sure our AGIs have a sure solution. (This is assuming we have those AGIs controlled by humans who are ethical enough to release control of the future into better hands once they’re available—a big if).