I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.