I’m pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear).
I don’t think I’m quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case).
One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50⁄50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.
It seems pretty plausible to me that a human simulator within GPT-n (the part producing the “what a human would say” features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn’t feel all that worst-case to me, but maybe we disagree on that point.
It seems pretty plausible to me that a human simulator within GPT-n (the part producing the “what a human would say” features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn’t feel all that worst-case to me, but maybe we disagree on that point.
I agree there are plenty of examples whether humans would be confident when they shouldn’t be. But to clarify, we can choose whatever examples we want in this step, so we can explicitly choose examples where we know humans have no real opinion about what the answer should be.
Ah I see, I think I was misunderstanding the method you were proposing. I agree that this strategy might “just work”.
Another concern I have is that a deceptively aligned model might just straightforwardly not learn to represent “the truth” at all—one speculative way this could happen is that a “situationally aware” and deceptive model might just “play the training game” and appear to learn to perform tasks at a superhuman level, but at test/inference time just resort to only outputting activations that correspond to beliefs that the human simulator would have. This seems pretty worst case-y, but I think I have enough concerns about deceptive alignment that this kind of strategy is still going to require some kind of check during the training process to ensure that the human simulator is being selected against. I’d be curious to hear if you agree or if you think that this approach would be robust to most training strategies for GPT-n and other kinds of models trained in a self-supervised way.
I’m pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear).
I don’t think I’m quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case).
It seems pretty plausible to me that a human simulator within GPT-n (the part producing the “what a human would say” features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn’t feel all that worst-case to me, but maybe we disagree on that point.
Thanks Ansh!
I agree there are plenty of examples whether humans would be confident when they shouldn’t be. But to clarify, we can choose whatever examples we want in this step, so we can explicitly choose examples where we know humans have no real opinion about what the answer should be.
Ah I see, I think I was misunderstanding the method you were proposing. I agree that this strategy might “just work”.
Another concern I have is that a deceptively aligned model might just straightforwardly not learn to represent “the truth” at all—one speculative way this could happen is that a “situationally aware” and deceptive model might just “play the training game” and appear to learn to perform tasks at a superhuman level, but at test/inference time just resort to only outputting activations that correspond to beliefs that the human simulator would have. This seems pretty worst case-y, but I think I have enough concerns about deceptive alignment that this kind of strategy is still going to require some kind of check during the training process to ensure that the human simulator is being selected against. I’d be curious to hear if you agree or if you think that this approach would be robust to most training strategies for GPT-n and other kinds of models trained in a self-supervised way.