Risks from Learned Optimization introduces a set of terminologies that help us think about the safety of ML systems, specifically as it relates to inner alignment. Here’s a general overview of what these ideas are.
A neural network is trained on some loss/reward function by a base optimizer (e.g., stochastic gradient descent on a large language model using next token prediction as the loss function). The loss function can also be thought of the base-objective Obase, and the base optimizer would select for algorithms that perform well on this base-objective.
After training, the neural net implements some algorithm, which we call the learned algorithm. The learned algorithm can itself be an optimization process (but it may also be, for example, a collection of heuristics). Optimizers are loosely defined, but the gist is that an optimizer is something that searches through a space of actions and picks one that scores the highest according to some function, which depends on the input it’s given. One can think of AlphaGo as an optimizer that searches through the space of the next moves and picks one that leads to the highest win probability.
When the learned algorithm is also an optimizer, we call it a mesa-optimizer. All optimizers have a goal, which we call the mesa-objective Omesa.The objective of the mesa-optimizer may be different from the base-objective which programmers have explicit control over. The mesa-objective, however, needs to be learned through training.
Inner misalignment happens when the learned mesa-objective differs from the base-objective. For example, if we are training a roomba neural net, we can use how clean the floor is as a reward function. That would be the base-objective. However, if the roomba is a mesa-optimizer, it could have different mesa-objectives such as maximizing the amount of dust sucked in or the amount of dust inside the dust collector. The post below deals with one such class of inner alignment failure: suboptimality alignment.
In the post, I sometimes compare suboptimality alignment with deceptive alignment, which is a complicated concept. I think it’s best to just read the actual paper if you want to understand that.
Basic Background:
Risks from Learned Optimization introduces a set of terminologies that help us think about the safety of ML systems, specifically as it relates to inner alignment. Here’s a general overview of what these ideas are.
A neural network is trained on some loss/reward function by a base optimizer (e.g., stochastic gradient descent on a large language model using next token prediction as the loss function). The loss function can also be thought of the base-objective Obase, and the base optimizer would select for algorithms that perform well on this base-objective.
After training, the neural net implements some algorithm, which we call the learned algorithm. The learned algorithm can itself be an optimization process (but it may also be, for example, a collection of heuristics). Optimizers are loosely defined, but the gist is that an optimizer is something that searches through a space of actions and picks one that scores the highest according to some function, which depends on the input it’s given. One can think of AlphaGo as an optimizer that searches through the space of the next moves and picks one that leads to the highest win probability.
When the learned algorithm is also an optimizer, we call it a mesa-optimizer. All optimizers have a goal, which we call the mesa-objective Omesa. The objective of the mesa-optimizer may be different from the base-objective which programmers have explicit control over. The mesa-objective, however, needs to be learned through training.
Inner misalignment happens when the learned mesa-objective differs from the base-objective. For example, if we are training a roomba neural net, we can use how clean the floor is as a reward function. That would be the base-objective. However, if the roomba is a mesa-optimizer, it could have different mesa-objectives such as maximizing the amount of dust sucked in or the amount of dust inside the dust collector. The post below deals with one such class of inner alignment failure: suboptimality alignment.
In the post, I sometimes compare suboptimality alignment with deceptive alignment, which is a complicated concept. I think it’s best to just read the actual paper if you want to understand that.