Riccardo Volpato comments on Deceptive Alignment

Riccardo Volpato 26 Jun 2020 15:49 UTC
1 point
1. Could internalization and modelling of the base objective happen simultanously? In some sense, since Darwin discovered evolution, isn’t that the state in which humans are? I guess that this is equivalent to saying that even if the mesa-optimiser has a model of the base-optimiser (condition 2 met) it cannot expect the threat of modification to eventually go away (condition 3 not met) since it is still under selection pressure and is experiencing internalization of the base objective. So if humans will ever be able to defeat mortality (can expect the threat of modification to eventually go away) will they stop having any incentive to self-improve?
once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease

This seems to be context-dependent to me, as for my example with humans: did learning about evoluation reduced our selection pressure?
1. How would the reflections on training vs testing apply to something like online learning? Could we simply solve deceptive alignment by never (fully) ending training?