Keeping all this in mind, the actual crux of the post to me seems:
I claim that GPT-4 is already pretty good at extracting preferences from human data. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where “adequate” means “about as good as humans”. And to be clear, I don’t mean that GPT-4 merely passively “understands” human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice, and this will become increasingly apparent in the near future as models get more capable and expand to more modalities.
[8] If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don’t think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.
About it, MIRI-in-my-head would say: “No. RLHF or similarly inadequate training techniques mean that GPT-N’s answers would build a bad proxy value function”.
And Matthew-in-my-head would say: “But in practice, when I interrogate GPT-4 its answers are fine, and they will improve further as LLMs get better. So I don’t see why future systems couldn’t be used to construct a good value function, actually”.
Keeping all this in mind, the actual crux of the post to me seems:
About it, MIRI-in-my-head would say: “No. RLHF or similarly inadequate training techniques mean that GPT-N’s answers would build a bad proxy value function”.
And Matthew-in-my-head would say: “But in practice, when I interrogate GPT-4 its answers are fine, and they will improve further as LLMs get better. So I don’t see why future systems couldn’t be used to construct a good value function, actually”.