FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don’t need one that’s quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn’t have ruled out every single policy that’s extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so complex that our training process doesn’t find them anyway. The main point of a deception detector wouldn’t be to give us guarantees, but instead to give training a sufficiently strong inductive bias against simple inductive policies, such that the next-simplest policy is non-deceptive.
FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don’t need one that’s quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn’t have ruled out every single policy that’s extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so complex that our training process doesn’t find them anyway. The main point of a deception detector wouldn’t be to give us guarantees, but instead to give training a sufficiently strong inductive bias against simple inductive policies, such that the next-simplest policy is non-deceptive.