Thank you for writing this! I’ve been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I’ll probably write that up as a separate post later, but for now I have a few questions:
What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of “aligned AI” and not “looks good to alignment researchers”?
Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?
I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn’t teach all human values—if it did, then RLHF finetuning wouldn’t be required at all. How can we know what values are “missing” from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is “good enough”?
Finally, this might be more of an objection than a question, but… One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that “automated ML research will happen anyway.” However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn’t it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it’s mentioned a lot in the post)? If ML researchers won’t do that satisfactorily, then isn’t dedicating safety effort to it differentially advancing capabilities?
But clearly LLM pretraining doesn’t teach all human values—if it did, then RLHF finetuning wouldn’t be required at all.
The simulators framing suggests that RLHF doesn’t teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn’t good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.
Thank you for writing this! I’ve been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I’ll probably write that up as a separate post later, but for now I have a few questions:
What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of “aligned AI” and not “looks good to alignment researchers”?
Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?
I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn’t teach all human values—if it did, then RLHF finetuning wouldn’t be required at all. How can we know what values are “missing” from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is “good enough”?
Finally, this might be more of an objection than a question, but… One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that “automated ML research will happen anyway.” However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn’t it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it’s mentioned a lot in the post)? If ML researchers won’t do that satisfactorily, then isn’t dedicating safety effort to it differentially advancing capabilities?
The simulators framing suggests that RLHF doesn’t teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn’t good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.