I agree that the probability of pseudo-alignment will be the same, and that an unrecoverable action could occur despite the threat of modification. I’m interested in whether online learning generally makes it less likely for a deceptively aligned model to defect. I think so because (I expect, in most cases) this adds a threat of modification that is faster-acting and easier for a mesa-optimizer to recognise than otherwise (e.g. human shutting it down).
I agree with all of this—online learning doesn’t change the probability of pseudo-alignment but might make it harder for a deceptively aligned model to defect. That being said, I don’t think that deceptive models defecting later is necessarily a good thing—if your deceptive models start defecting sooner, but in recoverable ways, that’s actually good because it gives you a warning shot. And once you have a deceptive model, it’s going to try to defect against you at some point, even if it just has to gamble and defect randomly with some probability.
If I’m not just misunderstanding and there is a crux here, maybe it relates to how promising worst-case guarantees are. Worst-case guarantees are great to have, and save us from worrying about precisely how likely a catastrophic action is. Maybe I am more pessimistic than you about obtaining worst-case guarantees. I think we should do more to model the risks probabilistically.
Second, I actually have done a bunch of probabilistic risk analysis on exactly this sort of situation here. Note, however, that the i.i.d. situation imagined in that analysis is extremely hard to realize in practice as there are fundamental distributional shifts that are very difficult to overcome—such as the distributional shift from a situation where the model can’t defect profitably to a situation where it can.
I agree with all of this—online learning doesn’t change the probability of pseudo-alignment but might make it harder for a deceptively aligned model to defect. That being said, I don’t think that deceptive models defecting later is necessarily a good thing—if your deceptive models start defecting sooner, but in recoverable ways, that’s actually good because it gives you a warning shot. And once you have a deceptive model, it’s going to try to defect against you at some point, even if it just has to gamble and defect randomly with some probability.
First, I do think that worst-case guarantees are achievable if we do relaxed adversarial training with transparency tools.
Second, I actually have done a bunch of probabilistic risk analysis on exactly this sort of situation here. Note, however, that the i.i.d. situation imagined in that analysis is extremely hard to realize in practice as there are fundamental distributional shifts that are very difficult to overcome—such as the distributional shift from a situation where the model can’t defect profitably to a situation where it can.