To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that T should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies).
Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.
FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don’t need one that’s quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn’t have ruled out every single policy that’s extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so complex that our training process doesn’t find them anyway. The main point of a deception detector wouldn’t be to give us guarantees, but instead to give training a sufficiently strong inductive bias against simple inductive policies, such that the next-simplest policy is non-deceptive.
To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that T should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies).
Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.
FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don’t need one that’s quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn’t have ruled out every single policy that’s extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so complex that our training process doesn’t find them anyway. The main point of a deception detector wouldn’t be to give us guarantees, but instead to give training a sufficiently strong inductive bias against simple inductive policies, such that the next-simplest policy is non-deceptive.
I see, that makes much more sense than my guess, thanks!