ryan_greenblatt comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

ryan_greenblatt 2 Sep 2024 19:56 UTC
LW: 2 AF: 2
0
AF
Agreed, I really should have said “or possibly even train against it”. I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!