My paraphrase of your (Matthews) position: while I’m not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don’t systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.
(End paraphrase)
I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here’s a bunch of different angles on why I think this:
My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.
I think you’re putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn’t help make an AI care about human values, then you’re probably mistaken about what actual problem the term ‘value specification’ is pointing at. (Or maybe you’re claiming that value specification is just not relevant to AI safety—but I don’t think you are?).
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that’s not all that much worse than pointing at a human is not cause for an update about how hard value specification is. Part of the difficulty is how to define the pointer to the human and get a model to maximize human values rather than maximize some error in your specification. IMO THCW makes this point pretty well.
It’s tricky to communicate problems in AI alignment―people come in with lots of different assumptions about what kind of things are easy / hard, and it’s hard to resolve disagreements because we don’t have a live AGI to do experiments on. I think THCW and related essays you criticize are actually great resources. They don’t try to communicate the entire problem at once because that’s infeasible. The fact that human values are complex and hard to specify explicitly is part of the reason why alignment is hard, where alignment means get the AI to care about human values, not get an AI to answer questions about moral behavior reasonably.
You claim the existence of GPT-4 is evidence against the claims in THCW. But IMO GPT-4 fits in neatly with THCW. The post even starts with a taxonomy of genies:
There are three kinds of genies: Genies to whom you can safely say “I wish for you to do what I should wish for”; genies for which no wish is safe; and genies that aren’t very powerful or intelligent.
GPT-4 is an example of a genie that is not very powerful or intelligent.
If in 5 years we build firefighter LLMs that can rescue mothers from burning buildings when you ask them to, that would also not show that we’ve solved value specification—it’s just a didactic example, not a full description of the actual technical problem. More broadly, I think it’s plausible that within a few years LLM will be able to give moral counsel far better than the average human. That still doesn’t solve value specification any more than the existence of humans that could give good moral counsel 20 years ago had solved value specification.
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment. This is an example of a result that doesn’t directly make an AI care about human values, but would probably lead to progress in that direction. I think if it turned out to be easy to formalize such a Q then I would change my mind about how hard value specification is.
While language models understand human values to some extent, they aren’t robust. The RHLF/RLAIF family of methods is based on using an LLM as a reward model, and to make things work you need to be careful not to optimize too hard or you’ll just get gibberish (Gao et al. 2022). LLMs don’t hold up against mundane RLHF optimization pressure, nevermind an actual superintelligence. (Of course, humans wouldn’t hold up either).
I’m sympathetic to some of these points, but overall I think it’s still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I’m not saying that the whole alignment problem is now easy. I’m making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve.
I think you’re putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.
I think the most plausibly correct interpretation here of “a genie must share the same values” is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn’t mean we’ve solved the other. However, again, I’m not claiming the whole problem has been solved.
It was always possible to attempt to solve the value specification problem by just pointing at a human.
Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn’t know whether the solutions would work well. I think we’ve now gotten more evidence about how hard this part of the problem is.
Do you have an example of one way that the full alignment problem is easier now that we’ve seen that GPT-4 can understand & report on human values?
(I’m asking because it’s hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it’s possible for outer alignment to become easier without the rest of the problem becoming easier).
I don’t speak for Matthew, but I’d like to respond to some points. My reading of his post is the same as yours, but I don’t fully agree with what you wrote as a response.
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn’t help make an AI care about human values, then you’re probably mistaken about what actual problem the term ‘value specification’ is pointing at.
[...]
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that’s not all that much worse than pointing at a human is not cause for an update about how hard value specification is
My objection to this is that if an LLM can substitute for a human, it could train the AI system we’re trying to align much faster and for much longer. This could make all the difference.
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment.
I suspect (and I could be wrong) that Q(observation, action) is basically what Matthew claims GPT-N could be. A human who gives moral counsel can only say so much and, therefore, can give less information to the model we’re trying to align. An LLM wouldn’t be as limited and could provide a ton of information about Q(observation, action), so we can, in practice, consider it as being our specification of Q(observation, action).
Edit: another option is that GPT-N, for the same reason of not being limited by speed, could write out a pretty huge Q(observation, action) that would be good, unlike a human.
My paraphrase of your (Matthews) position: while I’m not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don’t systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.
(End paraphrase)
I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here’s a bunch of different angles on why I think this:
My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.
I think you’re putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn’t help make an AI care about human values, then you’re probably mistaken about what actual problem the term ‘value specification’ is pointing at. (Or maybe you’re claiming that value specification is just not relevant to AI safety—but I don’t think you are?).
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that’s not all that much worse than pointing at a human is not cause for an update about how hard value specification is. Part of the difficulty is how to define the pointer to the human and get a model to maximize human values rather than maximize some error in your specification. IMO THCW makes this point pretty well.
It’s tricky to communicate problems in AI alignment―people come in with lots of different assumptions about what kind of things are easy / hard, and it’s hard to resolve disagreements because we don’t have a live AGI to do experiments on. I think THCW and related essays you criticize are actually great resources. They don’t try to communicate the entire problem at once because that’s infeasible. The fact that human values are complex and hard to specify explicitly is part of the reason why alignment is hard, where alignment means get the AI to care about human values, not get an AI to answer questions about moral behavior reasonably.
You claim the existence of GPT-4 is evidence against the claims in THCW. But IMO GPT-4 fits in neatly with THCW. The post even starts with a taxonomy of genies:
GPT-4 is an example of a genie that is not very powerful or intelligent.
If in 5 years we build firefighter LLMs that can rescue mothers from burning buildings when you ask them to, that would also not show that we’ve solved value specification—it’s just a didactic example, not a full description of the actual technical problem. More broadly, I think it’s plausible that within a few years LLM will be able to give moral counsel far better than the average human. That still doesn’t solve value specification any more than the existence of humans that could give good moral counsel 20 years ago had solved value specification.
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment. This is an example of a result that doesn’t directly make an AI care about human values, but would probably lead to progress in that direction. I think if it turned out to be easy to formalize such a Q then I would change my mind about how hard value specification is.
While language models understand human values to some extent, they aren’t robust. The RHLF/RLAIF family of methods is based on using an LLM as a reward model, and to make things work you need to be careful not to optimize too hard or you’ll just get gibberish (Gao et al. 2022). LLMs don’t hold up against mundane RLHF optimization pressure, nevermind an actual superintelligence. (Of course, humans wouldn’t hold up either).
I’m sympathetic to some of these points, but overall I think it’s still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I’m not saying that the whole alignment problem is now easy. I’m making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve.
I think the most plausibly correct interpretation here of “a genie must share the same values” is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn’t mean we’ve solved the other. However, again, I’m not claiming the whole problem has been solved.
Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn’t know whether the solutions would work well. I think we’ve now gotten more evidence about how hard this part of the problem is.
Do you have an example of one way that the full alignment problem is easier now that we’ve seen that GPT-4 can understand & report on human values?
(I’m asking because it’s hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it’s possible for outer alignment to become easier without the rest of the problem becoming easier).
I don’t speak for Matthew, but I’d like to respond to some points. My reading of his post is the same as yours, but I don’t fully agree with what you wrote as a response.
My objection to this is that if an LLM can substitute for a human, it could train the AI system we’re trying to align much faster and for much longer. This could make all the difference.
I suspect (and I could be wrong) that Q(observation, action) is basically what Matthew claims GPT-N could be. A human who gives moral counsel can only say so much and, therefore, can give less information to the model we’re trying to align. An LLM wouldn’t be as limited and could provide a ton of information about Q(observation, action), so we can, in practice, consider it as being our specification of Q(observation, action).
Edit: another option is that GPT-N, for the same reason of not being limited by speed, could write out a pretty huge Q(observation, action) that would be good, unlike a human.