DragonGod comments on Deceptive Alignment

DragonGod 20 Apr 2023 18:49 UTC
2 points
Is this a correct representation of corrigible alignment:
1. The mesa-optimizer (MO) has a proxy of the base objective that it’s optimising for.
2. As more information about the base objective is received, MO updates the proxy.
3. With sufficient information, the proxy may converge to a proper representation of the base objective.
4. Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
  1. The base objective is the reward signal
  2. The value function serves as a proxy for the base objective.
  3. The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.