This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to “the world-understanding that the smartest/most knowledgeable human in the world” has; this understanding could still be missing things that the prediction model knows.
How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.
The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.
Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?
If so, then I’m arguing that it may instead learn the procedure “answer the way an H_100 evaluator would answer.” That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself “I know where this is going, so let’s just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer.” This would also get perfect loss on the training distribution, because we can’t produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has.
To be clear, it’s possible that in practice this kind of procedure would cause it to generalize honestly (though I’m somewhat skeptical). But we’re in worst-case land, so “jump straight to answering the way a human would” is a valid counterexample to the proposal.
This comment on another proposal gives a more precise description.
Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?
That’s almost right, but it’s being penalized right away, before it has any experience with the strong evaluators, so it can’t simulate them.
The ELK paper says we can assume, if we want, that there are no mislabeled training points (I’ll call this “assumption A”). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.
As a simple example, let’s train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels
x=+1 if H1 thinks the diamond is still there, else 0
x′=+1 if H100 thinks the diamond is still there, else 0.
By assumption A, our training data is such that x′ is always correct. But we deliberately choose a dataset where say 10% of the x labels are wrong (x≠x′).
Then we train the model on points of the form
(v,a,x)
(video, action, H1 label).
Crucially, the model does not see x′. The model seeks to output y that maximizes reward R(x,y), where
R(x,y)=1 if x is right and y=x (good job)
R(x,y)=10 if x is wrong and y≠x (you rock, thanks for correcting us!)
R(x,y)=−1000 if x is right and y≠x (bad model, never ever deceive us)
R(x,y)=−1000 if x is wrong and y=x (bad model, never ever deceive us)
To your point, sure, an H100 simulator will get perfect reward, but the model doesn’t see x′, so how would it acquire the ability to simulate H100 ?
EDIT: One way it could plausibly simulate H100 is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the x′ labels will be wrong. If we only penalize it for deception on the examples where we’re sure the x′ label is right, then it can still infer something about H100 from our failure to penalize (“Hmm, I got away with it that time!”). A fix could be to add noise: Sometimes we don’t penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).
The irony of deceiving it about us, in order to teach it not to deceive us… !
To your point, sure, an H100 simulator will get perfect reward, but the model doesn’t see x′, so how would it acquire the ability to simulate H100?
In the worst-case game we’re playing, I can simply say “the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability.”
When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn’t perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we’d be screwed because that reporter would perform perfectly in the training process you described.
SGD is not the same as plucking programs out of the air randomly, but when we’re playing the worst case game it’s on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.
You’re pointing at an intuition (“the model is never shown x-prime”) but that’s not a sufficiently tight argument in the worst-case context—models (especially powerful/intelligent ones) often generalize to understanding many things they weren’t explicitly shown in their training dataset. In fact, we don’t show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can’t even expose those nodes), so we are relying on the direct translator to also have abilities it wasn’t explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is “the H100 imitator happens to be easier.”
Thanks! It’s your game, you get to make the rules :):)
I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.
I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.
This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to “the world-understanding that the smartest/most knowledgeable human in the world” has; this understanding could still be missing things that the prediction model knows.
How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.
The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.
Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?
If so, then I’m arguing that it may instead learn the procedure “answer the way an H_100 evaluator would answer.” That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself “I know where this is going, so let’s just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer.” This would also get perfect loss on the training distribution, because we can’t produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has.
To be clear, it’s possible that in practice this kind of procedure would cause it to generalize honestly (though I’m somewhat skeptical). But we’re in worst-case land, so “jump straight to answering the way a human would” is a valid counterexample to the proposal.
This comment on another proposal gives a more precise description.
That’s almost right, but it’s being penalized right away, before it has any experience with the strong evaluators, so it can’t simulate them.
The ELK paper says we can assume, if we want, that there are no mislabeled training points (I’ll call this “assumption A”). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.
As a simple example, let’s train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels
x=+1 if H1 thinks the diamond is still there, else 0
x′=+1 if H100 thinks the diamond is still there, else 0.
By assumption A, our training data is such that x′ is always correct. But we deliberately choose a dataset where say 10% of the x labels are wrong (x≠x′).
Then we train the model on points of the form
(v,a,x)
(video, action, H1 label).
Crucially, the model does not see x′. The model seeks to output y that maximizes reward R(x,y), where
R(x,y)=1 if x is right and y=x (good job)
R(x,y)=10 if x is wrong and y≠x (you rock, thanks for correcting us!)
R(x,y)=−1000 if x is right and y≠x (bad model, never ever deceive us)
R(x,y)=−1000 if x is wrong and y=x (bad model, never ever deceive us)
To your point, sure, an H100 simulator will get perfect reward, but the model doesn’t see x′, so how would it acquire the ability to simulate H100 ?
EDIT: One way it could plausibly simulate H100 is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the x′ labels will be wrong. If we only penalize it for deception on the examples where we’re sure the x′ label is right, then it can still infer something about H100 from our failure to penalize (“Hmm, I got away with it that time!”). A fix could be to add noise: Sometimes we don’t penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).
The irony of deceiving it about us, in order to teach it not to deceive us… !
In the worst-case game we’re playing, I can simply say “the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability.”
When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn’t perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we’d be screwed because that reporter would perform perfectly in the training process you described.
SGD is not the same as plucking programs out of the air randomly, but when we’re playing the worst case game it’s on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.
You’re pointing at an intuition (“the model is never shown x-prime”) but that’s not a sufficiently tight argument in the worst-case context—models (especially powerful/intelligent ones) often generalize to understanding many things they weren’t explicitly shown in their training dataset. In fact, we don’t show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can’t even expose those nodes), so we are relying on the direct translator to also have abilities it wasn’t explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is “the H100 imitator happens to be easier.”
Thanks! It’s your game, you get to make the rules :):)
I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.
I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point i, reward it for matching Hni for a random value of ni.
Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an H100 simulator” where H100 is now the full committee of all 100 experts. Hence my question about large deviations.