The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don’t train AIs to achieve goals, they will be “deceptively aligned” anyways.
In the TaskRabbit case, the model reasoned out loud “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs” and said to the person “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images.”
Isn’t this an existence proof that pretraining + RLHF can result in deceptively aligned AI?
You misunderstand what “deceptive alignment” refers to. This is a very common misunderstanding: I’ve seen several other people make the same mistake, and I have also been confused about it in the past. Here are some writings that clarify this:
(The terminology here is tricky. “Deceptive alignment” is not simply “a model deceives about whether it’s aligned”, but rather a technical term referring to a very particular threat model. Similarly, “scheming” is not just a general term referring to models making malicious plans, but again is a technical term pointing to a very particular threat model.)
I’m trying to understand what you mean in light of what seems like evidence of deceptive alignment that we’ve seen from GPT-4. Two examples that come to mind are the instance of GPT-4 using TaskRabbit to get around a CAPTCHA that ARC found and the situation with Bing/Sydney and Kevin Roose.
In the TaskRabbit case, the model reasoned out loud “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs” and said to the person “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images.”
Isn’t this an existence proof that pretraining + RLHF can result in deceptively aligned AI?
You misunderstand what “deceptive alignment” refers to. This is a very common misunderstanding: I’ve seen several other people make the same mistake, and I have also been confused about it in the past. Here are some writings that clarify this:
https://www.lesswrong.com/posts/dEER2W3goTsopt48i/olli-jaerviniemi-s-shortform?commentId=zWyjJ8PhfLmB4ajr5
https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai
https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai?commentId=ij9wghDCxjXpad8Rf
(The terminology here is tricky. “Deceptive alignment” is not simply “a model deceives about whether it’s aligned”, but rather a technical term referring to a very particular threat model. Similarly, “scheming” is not just a general term referring to models making malicious plans, but again is a technical term pointing to a very particular threat model.)
Thanks for the explanation and links. That makes sense