Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is someevidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.
If this is true, then the incredibly handwave-y solution is to just create a dataset where the simplest (good) process for estimating labels is to emulate an aligned human. At first pass this actually looks quite easy—it’s basically what we’re doing with language models already.
Unfortunately there’s quite a lot we swept under the rug. In particular this may not scale up as models get more powerful—the prior towards simplicity can be overcome if it results in lower loss, and if the dataset contains some labels that humans unknowingly rated incorrectly, the best process for estimating labels involves saying what humans believe is true rather than what actually is. This can already be seen with the sycophancy problems today’s LLMs are having.
There’s a lot of other thorny problems in this vein that you can come up with with a few minutes of thinking. That being said, it doesn’t seem completely doomed to me! There just needs to be a lot more work here. (But I haven’t spent too long thinking about this, so I could be wrong.)
To be clear, I don’t know the answer to this!
Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is some evidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.
If this is true, then the incredibly handwave-y solution is to just create a dataset where the simplest (good) process for estimating labels is to emulate an aligned human. At first pass this actually looks quite easy—it’s basically what we’re doing with language models already.
Unfortunately there’s quite a lot we swept under the rug. In particular this may not scale up as models get more powerful—the prior towards simplicity can be overcome if it results in lower loss, and if the dataset contains some labels that humans unknowingly rated incorrectly, the best process for estimating labels involves saying what humans believe is true rather than what actually is. This can already be seen with the sycophancy problems today’s LLMs are having.
There’s a lot of other thorny problems in this vein that you can come up with with a few minutes of thinking. That being said, it doesn’t seem completely doomed to me! There just needs to be a lot more work here. (But I haven’t spent too long thinking about this, so I could be wrong.)