2. If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.
3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.
Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)
I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.
Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)