Evan R. Murphy comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Evan R. Murphy 14 Aug 2022 6:10 UTC
LW: 5 AF: 3
0
AF
This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn’t that buy us a lot even without the retargeting mechanism?
We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is far from cheap on today’s large models, but you might not have to go through the entire training process before the optimizer emerges and gains a stable optimization target. This could allow us to iterate on the search target and verify once we have the one we want before having to deploy the model in an unsafe environment.