The main weakness of MTD is the somewhat fuzzy properties it relies on: so far, we haven’t been able to find a precise definition of measurement tampering that can be used to unambiguously determine if a given dataset has the right “measurement tampering structure”.
Distinguishing the difference between “doing X makes the scores from the weak supervisor better in valid ways, by actually making things better in ways that the weak supervisor can detect” and “doing Y makes the scores from the weak supervisor better in invalid ways, by making the weak supervisor more likely to make a mistake in a favorable direction” is a value judgement that a) I would expect LLM to understand humans well enough to be able to make moderately good judgments on (i.e. ones likely to match human judgements), and b) I would expect suitably scaffolded and prompted LLMs to be able to describe to humans and ask their opinion on for confirmation.
I expect that future powerful model will know whether they are measurement tampering. They are also reasonably likely to fully understand that human supervisors would prefer this not to happen.
However, it’s less clear that various approaches related to using LLMs will work, particularly in cases where humans don’t understand what is going on at all (e.g. imagine alphafold4, but after RL to make it generate proteins which are useful in particular ways—it’s pretty unclear if humans would understand why the proteins work).
I think it’s totally plausible that very simple baselines like “just ask the model and generalize from data humans understand” will work for even quite powerful models. But, they might fail and we’d like to start working on tricky cases now
Distinguishing the difference between “doing X makes the scores from the weak supervisor better in valid ways, by actually making things better in ways that the weak supervisor can detect” and “doing Y makes the scores from the weak supervisor better in invalid ways, by making the weak supervisor more likely to make a mistake in a favorable direction” is a value judgement that a) I would expect LLM to understand humans well enough to be able to make moderately good judgments on (i.e. ones likely to match human judgements), and b) I would expect suitably scaffolded and prompted LLMs to be able to describe to humans and ask their opinion on for confirmation.
I expect that future powerful model will know whether they are measurement tampering. They are also reasonably likely to fully understand that human supervisors would prefer this not to happen.
However, it’s less clear that various approaches related to using LLMs will work, particularly in cases where humans don’t understand what is going on at all (e.g. imagine alphafold4, but after RL to make it generate proteins which are useful in particular ways—it’s pretty unclear if humans would understand why the proteins work).
I think it’s totally plausible that very simple baselines like “just ask the model and generalize from data humans understand” will work for even quite powerful models. But, they might fail and we’d like to start working on tricky cases now
See the ELK report for more discussion.