I expect that future powerful model will know whether they are measurement tampering. They are also reasonably likely to fully understand that human supervisors would prefer this not to happen.
However, it’s less clear that various approaches related to using LLMs will work, particularly in cases where humans don’t understand what is going on at all (e.g. imagine alphafold4, but after RL to make it generate proteins which are useful in particular ways—it’s pretty unclear if humans would understand why the proteins work).
I think it’s totally plausible that very simple baselines like “just ask the model and generalize from data humans understand” will work for even quite powerful models. But, they might fail and we’d like to start working on tricky cases now
I expect that future powerful model will know whether they are measurement tampering. They are also reasonably likely to fully understand that human supervisors would prefer this not to happen.
However, it’s less clear that various approaches related to using LLMs will work, particularly in cases where humans don’t understand what is going on at all (e.g. imagine alphafold4, but after RL to make it generate proteins which are useful in particular ways—it’s pretty unclear if humans would understand why the proteins work).
I think it’s totally plausible that very simple baselines like “just ask the model and generalize from data humans understand” will work for even quite powerful models. But, they might fail and we’d like to start working on tricky cases now
See the ELK report for more discussion.