I’m sympathetic to some of these points, but overall I think it’s still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I’m not saying that the whole alignment problem is now easy. I’m making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve.
I think you’re putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.
I think the most plausibly correct interpretation here of “a genie must share the same values” is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn’t mean we’ve solved the other. However, again, I’m not claiming the whole problem has been solved.
It was always possible to attempt to solve the value specification problem by just pointing at a human.
Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn’t know whether the solutions would work well. I think we’ve now gotten more evidence about how hard this part of the problem is.
Do you have an example of one way that the full alignment problem is easier now that we’ve seen that GPT-4 can understand & report on human values?
(I’m asking because it’s hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it’s possible for outer alignment to become easier without the rest of the problem becoming easier).
I’m sympathetic to some of these points, but overall I think it’s still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I’m not saying that the whole alignment problem is now easy. I’m making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve.
I think the most plausibly correct interpretation here of “a genie must share the same values” is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn’t mean we’ve solved the other. However, again, I’m not claiming the whole problem has been solved.
Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn’t know whether the solutions would work well. I think we’ve now gotten more evidence about how hard this part of the problem is.
Do you have an example of one way that the full alignment problem is easier now that we’ve seen that GPT-4 can understand & report on human values?
(I’m asking because it’s hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it’s possible for outer alignment to become easier without the rest of the problem becoming easier).