Thanks for this post! I have to admit that I took some time to read it because I believed that it would be basic, but I really like the focus on more current techniques (which makes sense since you cofounded and work at OpenAI).
Let’s start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn’t mean that it’ll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it’ll repeat the misconceptions and flaws of typical authors, even if it knows that they’re mistaken about something.
That doesn’t feel as bad as you describe it for me. Sure, if you literally call a “wise old man” from the literature (or god forbid, reddit), that might end up pretty badly. But we might go for a tighter control around the sort of “language producer” were trying to instantiate. Or go microscope AI.
I also feel like this answer from the Advocate disparage a potentially very big deal for language models: the fact that they might pick up the human abstractions because they learn to model language, and our use of language is littered with these abstractions. This is a potentially strong version of the natural abstraction hypothesis, which seems like it makes the problem easier in some ways. For example, we have more chance of understanding what the model might do because it’s trying to predict a system (language) that we use constantly at that level of granularity, as opposed to images that we never think of pixels by pixels.
Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We’ll need to use reinforcement learning.)
I want to point out that from an alignment standpoint, this looks like a very dangerous step. One thing Language Models have for them is that what they optimize for isn’t what we use them for exactly, and so they avoid potential issues like goodharting. This would be completely destroyed by adding an explicit optimization step at the end.
Returning to the original question, there was the claim that alignment gets easier as the models get smarter. It does get easier in some ways, but it also gets harder in others. Smarter models will be better at gaming our reward functions in unexpected and clever ways—for example, producing the convincing illusion of being insightful or helpful, while actually being the opposite. And eventually they’ll be capable of intentionally deceiving us.
I think this is definitely an important point that goes beyond the special case of language models that you mostly discuss before.
While alignment and capabilities aren’t distinct, they correspond to different directions that we can push the frontier of AI. Alignment advances make it easier to optimize hard-to-measure objectives like being helpful or truthful. Capabilities advances also sometimes make our models more helpful and more accurate, but they also make the models more potentially dangerous.
On thing I would want to point out is that another crucial difference lies in the sort of conceptual research that is done in alignment. Deconfusion of ideas like power-seeking, enlightened judgment and goal-directedness are rarely that useful for capabilities, but I’m pretty convinced they are crucial for understanding better the alignment risks and how to deal with them.
Thanks for this post! I have to admit that I took some time to read it because I believed that it would be basic, but I really like the focus on more current techniques (which makes sense since you cofounded and work at OpenAI).
That doesn’t feel as bad as you describe it for me. Sure, if you literally call a “wise old man” from the literature (or god forbid, reddit), that might end up pretty badly. But we might go for a tighter control around the sort of “language producer” were trying to instantiate. Or go microscope AI.
All these do require more alignment focused work though. I’m particularly excited of perspectives of language models as simulators of many small models of things producting/influencing language, and of techniques related to that view, like meta-prompts or counterfactual parsing.
I also feel like this answer from the Advocate disparage a potentially very big deal for language models: the fact that they might pick up the human abstractions because they learn to model language, and our use of language is littered with these abstractions. This is a potentially strong version of the natural abstraction hypothesis, which seems like it makes the problem easier in some ways. For example, we have more chance of understanding what the model might do because it’s trying to predict a system (language) that we use constantly at that level of granularity, as opposed to images that we never think of pixels by pixels.
I want to point out that from an alignment standpoint, this looks like a very dangerous step. One thing Language Models have for them is that what they optimize for isn’t what we use them for exactly, and so they avoid potential issues like goodharting. This would be completely destroyed by adding an explicit optimization step at the end.
I think this is definitely an important point that goes beyond the special case of language models that you mostly discuss before.
On thing I would want to point out is that another crucial difference lies in the sort of conceptual research that is done in alignment. Deconfusion of ideas like power-seeking, enlightened judgment and goal-directedness are rarely that useful for capabilities, but I’m pretty convinced they are crucial for understanding better the alignment risks and how to deal with them.