I agree that MIRI’s initial replies don’t seem to address your points and seem to be straw-manning you. But there is one point they’ve made, which appears in some comments, that seems central to me. I could translate it in this way to more explicitly tie it to your post:
”Even if GPT-N can answer questions about whether outcomes are bad or good, thereby providing “a value function”, that value function is still a proxy for human values since what the system is doing is still just relaying answers that would make humans give thumbs up or thumbs down.”
To me, this seems like the strongest objection. You haven’t solved the value specification problem if your value function is still a proxy that can be goodharted etc.
If you think about it in this way, then it seems like the specification problem gets moved to the procedure you use to finetune large language models to make them able to give answers about human values. If the training mechanism you use to “lift” human values out of LLM’s predictive model is imperfect, then the answers you get won’t be good enough to build a value function that we can trust.
That said, we have GPT-4 now, and with better subsequent alignment techniques, I’m not so sure we won’t be able to get an actual good value function by querying some more advanced and better-aligned language model and then using it as a training signal for something more agentic. And yeah, at that point, we still have the inner alignment part to solve, granted that we solve the value function part, and I’m not sure we should be a lot more optimistic than before having considered all these arguments. Maybe somewhat, though, yeah.
Keeping all this in mind, the actual crux of the post to me seems:
I claim that GPT-4 is already pretty good at extracting preferences from human data. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where “adequate” means “about as good as humans”. And to be clear, I don’t mean that GPT-4 merely passively “understands” human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice, and this will become increasingly apparent in the near future as models get more capable and expand to more modalities.
[8] If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don’t think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.
About it, MIRI-in-my-head would say: “No. RLHF or similarly inadequate training techniques mean that GPT-N’s answers would build a bad proxy value function”.
And Matthew-in-my-head would say: “But in practice, when I interrogate GPT-4 its answers are fine, and they will improve further as LLMs get better. So I don’t see why future systems couldn’t be used to construct a good value function, actually”.
I agree that MIRI’s initial replies don’t seem to address your points and seem to be straw-manning you. But there is one point they’ve made, which appears in some comments, that seems central to me. I could translate it in this way to more explicitly tie it to your post:
”Even if GPT-N can answer questions about whether outcomes are bad or good, thereby providing “a value function”, that value function is still a proxy for human values since what the system is doing is still just relaying answers that would make humans give thumbs up or thumbs down.”
To me, this seems like the strongest objection. You haven’t solved the value specification problem if your value function is still a proxy that can be goodharted etc.
If you think about it in this way, then it seems like the specification problem gets moved to the procedure you use to finetune large language models to make them able to give answers about human values. If the training mechanism you use to “lift” human values out of LLM’s predictive model is imperfect, then the answers you get won’t be good enough to build a value function that we can trust.
That said, we have GPT-4 now, and with better subsequent alignment techniques, I’m not so sure we won’t be able to get an actual good value function by querying some more advanced and better-aligned language model and then using it as a training signal for something more agentic. And yeah, at that point, we still have the inner alignment part to solve, granted that we solve the value function part, and I’m not sure we should be a lot more optimistic than before having considered all these arguments. Maybe somewhat, though, yeah.
Keeping all this in mind, the actual crux of the post to me seems:
About it, MIRI-in-my-head would say: “No. RLHF or similarly inadequate training techniques mean that GPT-N’s answers would build a bad proxy value function”.
And Matthew-in-my-head would say: “But in practice, when I interrogate GPT-4 its answers are fine, and they will improve further as LLMs get better. So I don’t see why future systems couldn’t be used to construct a good value function, actually”.