Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:
Is it actually robustly aligned?
How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?
The different possible answers to these questions give us four cases to consider:
It’s not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it’s trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow “points” to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.
Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:
Is it actually robustly aligned?
How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?
The different possible answers to these questions give us four cases to consider:
It’s not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it’s trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow “points” to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.
Is this a correct representation of corrigible alignment:
The mesa-optimizer (MO) has a proxy of the base objective that it’s optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
The base objective is the reward signal
The value function serves as a proxy for the base objective.
The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.