Donald Hobson answers Do mesa-optimizer risk arguments rely on the train-test paradigm?

Donald Hobson 10 Sep 2020 21:23 UTC
LW: 6 AF: 4
AF
Suppose you are a mesa-optimiser. You want X, but your training function is towards Y.
You know you are in a continual learning model, if you don’t produce enough Y, the gradient decent will modify you into something else.
The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.