As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions.
I’m very excited about strategies that involve lots of imitation learning on lots of particular humans. I’m not sure if imitated human researchers learn to generalize to doing lots of novel research, but this seems great for examining research outputs of slightly-more-alien agents very quickly.
To me this doesn’t seem like a failure of sophisticated reward models, it’s the failure of unsophisticated reward models (unit tests) when they’re being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was “really doing the work”, this sort of special-casing would get totally trained out.
(Not claiming that this is always the case, or that models couldn’t be deceptive here, or that e.g. 3.8 couldn’t reward hack 3.7)