think about how humans most often deceive other humans: we do it mainly by deceiving ourselves… when that sort of deception happens, I wouldn’t necessarily expect to be able to see deception in an AI’s internal thoughts
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it’s theoretically possible to identify deception, even in cases of self-deception.
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it’s theoretically possible to identify deception, even in cases of self-deception.
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)