To your point, sure, an H100 simulator will get perfect reward, but the model doesn’t see x′, so how would it acquire the ability to simulate H100?
In the worst-case game we’re playing, I can simply say “the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability.”
When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn’t perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we’d be screwed because that reporter would perform perfectly in the training process you described.
SGD is not the same as plucking programs out of the air randomly, but when we’re playing the worst case game it’s on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.
You’re pointing at an intuition (“the model is never shown x-prime”) but that’s not a sufficiently tight argument in the worst-case context—models (especially powerful/intelligent ones) often generalize to understanding many things they weren’t explicitly shown in their training dataset. In fact, we don’t show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can’t even expose those nodes), so we are relying on the direct translator to also have abilities it wasn’t explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is “the H100 imitator happens to be easier.”
Thanks! It’s your game, you get to make the rules :):)
I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.
I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.
In the worst-case game we’re playing, I can simply say “the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability.”
When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn’t perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we’d be screwed because that reporter would perform perfectly in the training process you described.
SGD is not the same as plucking programs out of the air randomly, but when we’re playing the worst case game it’s on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.
You’re pointing at an intuition (“the model is never shown x-prime”) but that’s not a sufficiently tight argument in the worst-case context—models (especially powerful/intelligent ones) often generalize to understanding many things they weren’t explicitly shown in their training dataset. In fact, we don’t show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can’t even expose those nodes), so we are relying on the direct translator to also have abilities it wasn’t explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is “the H100 imitator happens to be easier.”
Thanks! It’s your game, you get to make the rules :):)
I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.
I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.