I’m not certain, but I think the explanation might be that Zvi was thinking of “deception”, whereas Joe, Quintin, and Nora were talking about the more specific “deceptive alignment”.
Deceptive alignment is more centrally a special case of being trustworthy (what the “alignment” part of “deceptive alignment” refers to), not of being deceptive. In a recent post, Zvi says:
We are constantly acting in order to make those around us think well of us, trust us, expect us to be on their side, and so on. We learn to do this instinctually, all the time, distinct from what we actually want. Our training process, childhood and in particular school, trains this explicitly, you need to learn to show alignment in the test set to be allowed into the production environment, and we act accordingly.
A human is considered trustworthy rather than deceptively aligned when they are only doing this within a bounded set of rules, and not outright lying to you. They still engage in massive preference falsification, in doing things and saying things for instrumental reasons, all the time.
My model says that if you train a model using current techniques, of course exactly this happens.
Deceptive alignment is more centrally a special case of being trustworthy (what the “alignment” part of “deceptive alignment” refers to), not of being deceptive. In a recent post, Zvi says: