Agreed, I really should have said “or possibly even train against it”. I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!
Agreed, I really should have said “or possibly even train against it”. I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!