Is this a correct representation of corrigible alignment:
The mesa-optimizer (MO) has a proxy of the base objective that it’s optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
The base objective is the reward signal
The value function serves as a proxy for the base objective.
The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.
Is this a correct representation of corrigible alignment:
The mesa-optimizer (MO) has a proxy of the base objective that it’s optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
The base objective is the reward signal
The value function serves as a proxy for the base objective.
The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.