Yes, it’s worth pulling out that the mesa-optimizers demonstrated here are not consequentialists, they are optimizing the goodness of fit of an internal representation to in-context data.
The role this plays in arguments about deceptive alignment is that it neutralizes the claim that “it’s probably not a realistically efficient or effective or inductive-bias-favoured structure to actually learn an internal optimization algorithm”. Arguments like “it’s not inductive-bias-favoured for mesa-optimizers to be consequentialists instead of maximizing purely epistemic utility” remain.
Although I predict someone will find consequentialist mesa-optimizers in Decision Transformers, that has not (to my knowledge) actually been seen yet.
Yes, it’s worth pulling out that the mesa-optimizers demonstrated here are not consequentialists, they are optimizing the goodness of fit of an internal representation to in-context data.
The role this plays in arguments about deceptive alignment is that it neutralizes the claim that “it’s probably not a realistically efficient or effective or inductive-bias-favoured structure to actually learn an internal optimization algorithm”. Arguments like “it’s not inductive-bias-favoured for mesa-optimizers to be consequentialists instead of maximizing purely epistemic utility” remain.
Although I predict someone will find consequentialist mesa-optimizers in Decision Transformers, that has not (to my knowledge) actually been seen yet.