As I keep saying, deception is not some unique failure mode. Humans are constantly engaging in various forms of deception. It is all over the training data, and any reasonable set of next token predictions. There is no clear line between deception and not deception. And yes, to the extent that humans reward ‘deceptive’ responses, as they no doubt often will inadvertently do, the model will ‘learn’ deception.
https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai
(found in the comments of this prediction market)