There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.
“trying to appear aligned” seems imprecise to me—unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there’s no requirement for modelling of the base optimiser or oversight system; it’s enough to understand the optimisation pressure and be uncertain whether it’ll persist).
Are you thinking that it makes sense to agree to test for systems that are “trying to appear aligned”, or would you want to include any system that is instrumentally acting such that it’s unaltered by optimisation pressure?
“trying to appear aligned” seems imprecise to me—unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there’s no requirement for modelling of the base optimiser or oversight system; it’s enough to understand the optimisation pressure and be uncertain whether it’ll persist).
Are you thinking that it makes sense to agree to test for systems that are “trying to appear aligned”, or would you want to include any system that is instrumentally acting such that it’s unaltered by optimisation pressure?