Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)
Of course! I don’t intend to claim that it’s impossible-in-principle to detect this sort of thing. But if we’re expecting “thinking in ways which are selected for deceiving humans”, then we need to look for different (and I’d expect more general) things than if we’re just looking for “thinking about deceiving humans”.
(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)