...I just completed reading “Risks from Learned Optimization”, so here are some remarks on the corrigibility related issues raised in that paper. These remarks might also address your concerns above, but I am not 100% sure if I am addressing them all.
Full disclosure: I came across “Risks from Learned Optimization” while preparing the related work section of my paper. My first impression then was that it dealt with a somewhat different safety concern, so I put it on my read-later backlog. Now that I have read it, I see that section 4 discuses safety concerns of a type that corrigibility intends to address, so the two papers are more related than I originally thought.
Note for the reader: I will use the terminology of the “Risks from..” paper in the paragraphs below.
Section 4 or “Risks from...” considers the problem that the agent that is constructed by the base optimizer might have an emergent incentive to resist later updates, updates intended to make it more aligned. This resistance may take the form of ‘deceptive alignment’. Intuitively, to me, this is a real risk: it may not apply to current base optimisers but I expect it to come up in future architectures. For me this risk is somewhat orthogonal to the question of whether the constructed agent might also contain a mesa-optimiser.
The “Risks..” paper warns, correctly I feel, that if the base optimizer is supplied with an objective/utility function that includes a layer to ensure corrigibility, this does not imply that the agent created by the base optimizer will also be perfectly corrigible. Some loss of fidelity might occur during the construction.
As a sub-case, and I think this relates to the concern raised in the comment above, if the base optimizer is supplied with an objective/utility function carefully constructed so that it does not create any incentive to spend resources resisting utility function updates, it does not follow that the agent created by the base optimizer will also perfectly lack any incentive to resist. Again there may be a loss of fidelity. As mentioned in the comment above, it would be extraordinarily strong to make the assumption that a carefully constructed property of ignorance or indifference present in the training utility function is exactly preserved in the constructed agent.
I believe that including a corrigibility layer into the utility function used by the base optimizer will likely help to improve the safety of the constructed agent—if the agent that is constructed is scored on displaying corrigible behaviour, it is likely to be more corrigible.
But my preferred way of applying the corrigibility safety layer would be to apply it directly to the agent that was constructed. An important point of the corrigibility design is that it can take a very complex and opaque base utility function U_N, and then constructs an agent with provable properties.
So what we really want to do is to take the agent constructed by the base optimizer, and apply a software transformation that adds the corrigibility layer to it. General theories of computation and machine intelligence say that a such a software transformation will exist. However, applying the transformation may be non-trivial, or may lead to an unacceptable performance penalty. For the agent architectures defined in the various corrigibility papers, the transformation is easy, and has a relatively low impact on agent performance. But it is unclear to me if and how a transformation could be done for a neural net or table-driven agent. Agents that contain clear mesa-optimizers inside, agents that search a solution space while scoring alternative outcomes numerically, may be actually easier to transform than agents using other architectures. So I am not sure if it should be a goal to come up with base optimises that suppress the construction of mesa-optimizers. Designing base optimisers to produce outputs that slot into a mesa-optimiser architecture to which software transformations can be applied may be another way to go.
I have some additional thoughts and comments on “Risks from Learned Optimization”, but I will post them later in the comment sections there.
I think I mostly agree with what you’re saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong:
The learned model might be so opaque that you can’t figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way.
If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
If corrigibility is applied as a separate layer, the model could figure out how to disable it.
I think if you don’t have some way of training your model to be transparent, you’re not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that.
2. If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.
3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.
Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)
...I just completed reading “Risks from Learned Optimization”, so here are some remarks on the corrigibility related issues raised in that paper. These remarks might also address your concerns above, but I am not 100% sure if I am addressing them all.
Full disclosure: I came across “Risks from Learned Optimization” while preparing the related work section of my paper. My first impression then was that it dealt with a somewhat different safety concern, so I put it on my read-later backlog. Now that I have read it, I see that section 4 discuses safety concerns of a type that corrigibility intends to address, so the two papers are more related than I originally thought.
Note for the reader: I will use the terminology of the “Risks from..” paper in the paragraphs below.
Section 4 or “Risks from...” considers the problem that the agent that is constructed by the base optimizer might have an emergent incentive to resist later updates, updates intended to make it more aligned. This resistance may take the form of ‘deceptive alignment’. Intuitively, to me, this is a real risk: it may not apply to current base optimisers but I expect it to come up in future architectures. For me this risk is somewhat orthogonal to the question of whether the constructed agent might also contain a mesa-optimiser.
The “Risks..” paper warns, correctly I feel, that if the base optimizer is supplied with an objective/utility function that includes a layer to ensure corrigibility, this does not imply that the agent created by the base optimizer will also be perfectly corrigible. Some loss of fidelity might occur during the construction.
As a sub-case, and I think this relates to the concern raised in the comment above, if the base optimizer is supplied with an objective/utility function carefully constructed so that it does not create any incentive to spend resources resisting utility function updates, it does not follow that the agent created by the base optimizer will also perfectly lack any incentive to resist. Again there may be a loss of fidelity. As mentioned in the comment above, it would be extraordinarily strong to make the assumption that a carefully constructed property of ignorance or indifference present in the training utility function is exactly preserved in the constructed agent.
I believe that including a corrigibility layer into the utility function used by the base optimizer will likely help to improve the safety of the constructed agent—if the agent that is constructed is scored on displaying corrigible behaviour, it is likely to be more corrigible.
But my preferred way of applying the corrigibility safety layer would be to apply it directly to the agent that was constructed. An important point of the corrigibility design is that it can take a very complex and opaque base utility function U_N, and then constructs an agent with provable properties.
So what we really want to do is to take the agent constructed by the base optimizer, and apply a software transformation that adds the corrigibility layer to it. General theories of computation and machine intelligence say that a such a software transformation will exist. However, applying the transformation may be non-trivial, or may lead to an unacceptable performance penalty. For the agent architectures defined in the various corrigibility papers, the transformation is easy, and has a relatively low impact on agent performance. But it is unclear to me if and how a transformation could be done for a neural net or table-driven agent. Agents that contain clear mesa-optimizers inside, agents that search a solution space while scoring alternative outcomes numerically, may be actually easier to transform than agents using other architectures. So I am not sure if it should be a goal to come up with base optimises that suppress the construction of mesa-optimizers. Designing base optimisers to produce outputs that slot into a mesa-optimiser architecture to which software transformations can be applied may be another way to go.
I have some additional thoughts and comments on “Risks from Learned Optimization”, but I will post them later in the comment sections there.
I think I mostly agree with what you’re saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong:
The learned model might be so opaque that you can’t figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way.
If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
If corrigibility is applied as a separate layer, the model could figure out how to disable it.
I think if you don’t have some way of training your model to be transparent, you’re not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that.
I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.
Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)