I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.
I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.