My (naive) take on Risks from Learned Optimization

This is my current understanding of Risks from Learned Optimization. I’m pretty sure there are mistakes here and there. I also list things I agree or not with in the paper as I think it might be too large for one comment. So feel free to point out where I’m wrong (and why)!

Parent optimization process might intentionally or unintentionally create another child optimization process. It’s more likely if a problem couldn’t be fully solved by a parent optimizer, there are too many policies to create; the smaller parent the more likelihood of child optimizer appearance and vice versa; the parent is biased towards simplicity. Less likely if the parent has less state or not enough power to find a child. Examples of child optimization are evolution and human brain, teacher and student, society and corporation, programmer and a program that searches solutions to NP problem, and so on.

A child is likely to have goals different to parent’s goals when it is free from parent’s control or deployed in another environment. In that case a child’s goals might be totally different from parent’s ones, or be near them, or be not optimal. If a child’s objective is totally different from the parent’s then it might nevertheless increase the parent’s objective if it doesn’t interfere with its own goals or if it can pursue its own goals afterwards. It is unclear how to match a child’s objective, which is indirectly learned, to the parent’s one (inner alignment). Especially risky is a child’s deception that happens if the child can project or plan many steps ahead, knows the parent’s objective, and there is a threat of shutting down or modification. Deception is more likely as we create objective proxies, or not train well before deployment.

It is likely that without child optimization systems will be much less useful but with this optimization in place the existential risk from AI increases.

In the glossary:

An optimizer is a system that internally searches through some space...

But shouldn’t it be a process and not a system? Shouldn’t it be only some computational process? To my mind the optimizer in this work is a computational process because it does a calculation to find an optimal solution. Still it is arguable if evolution is a computational process but I think it is in accordance with the Church-Turing thesis as optimization is a close ended and physical process. In that sense it is hard to imagine a pile of bricks or a rock to be an optimizer though they are systems.

Would be good to have some predictions for claims to be falsifiable. As it is largely theoretical.

Example:

At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.

What experiment might prove it false or right? Also it seems that deception is a trickier algorithm than just following a base objective. So what experiment might find the bound where an algorithm should find such a more complex strategy as deception?

Also, it seems the paper misses a list of architectures (current or future) that might exhibit mesa-optimization. Or make a hypothesis of what architectures might produce mesa-optimization.

I’m not sure that the mesa optimizer’s objective can’t be set directly. Why?

Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer.

Why can’t we make the base optimizer inject into mesa-optimizer’s objective some our objective? So that it is some mix of goals. This might be some high level restrictions like not to lie or not make some dangerous actions. That is, on each step when a base optimizer finishes with a current mesa-optimizer, why don’t we mix it with another model or strategy? Or why can’t we modify a final mesa-optimizer using reverse engineering and inject another strategy. Example might be a language model that has an output channel into a command line to execute commands and python programs from there. We can hijack its output channel and insert our python code.

I agree it is right that it poses great risk when goals are misaligned. Because if we can’t control the mesa-objective then it is likely it would pursue some proxy objective like increasing the number of smiles or clicks. And, because of Goodhart’s law, there is a tendency that such goals fail to be a good objective. Canonical example of this is King Midas who wished to have gold from whatever he touched.

Agree that it is more likely that mesa optimization will be present if problem space complexity increases as we see it in evolution. Still it is unclear how many iterations, and what time should pass before that moment. Evolution shows that complex systems appeared later than those that are simpler. So we might be satisfied with simple models and avoid complex ones. That is important because it is directly connected to x-risk and disempowerment of humanity by a more intelligent system(s).

Agree that it is easier to find misaligned than an aligned optimizer because an aligned optimizer should be more complex. Simpler solutions are more reachable because adding features to the objective doesn’t make the optimizer find it quicker but usually harder. Example is a task to find a barber in some town and find a blond barber in this town. It might be the same probability if all barbers are blond but it is not possible that the probability will increase for the second case.

Agree that the larger mesa-optimizer is the better aligned it is, but the less likely a base optimizer finds it. Because aligning powerful and critical optimizers should require a larger amount of information. Example is self-driving cars which might easily learn road signs but will require more space to learn how to react to various objects on the road (humans, dogs, etc.).

Agree that deception is an especially hard problem, but it looks to me the paper misses additional discussion about model honesty or truthfulness which might be a solution to mesa-optimizer deception. Consider detecting deception at train time: if there exists a way to ask a mesa optimizer at train time if it will give different results at test / deploy time. In that case if we have an honest mesa optimizer that can somehow tell us this information then we detect that it will be a treacherous turn and modify its objection at the train time.

It seems to me it needs more research about the risk from the collusion of deceptive agents or optimizers that might disempower humanity. Perhaps authors could propose an experiment that would prove the likelihood of simultaneous deception of several models.

Regarding the conclusion. I’m not sure about the 3rd case in the consequences:

Furthermore, in such a scenario, some parts of the outer alignment problem may not need to be solved either …

Looks like we should solve the whole outer alignment in that case because what if a base optimizer nevertheless finds a way to mesa-optimization because it might have ability and we didn’t make it not to do so. Hence again we might face the inner alignment problem or even deception. It seems there is a small risk of it. Or we need more clarification, what does “preventing mesa-optimizers from arising″ mean. Does it mean that it might happen occasionally but disabled, i.e. possibility of it still holds? I’m saying it because it is widely known how unstable and buggy our software currently is. There is even ‘a law’ for this, Murphy’s law, that says that what can be broken will be broken.

It seems that the paper misses concrete examples of how mesa-optimization might occur in an optimizer that was not designed for mesa-optimization, i.e. unintended optimization. Here is an example derived from “The alignment problem from a deep learning perspective” by Richard Ngo. Suppose we can have one large optimizer that actively interacts with the world, some giant model constantly updating its weights and running in a large gpu farm. It will interact with the world by emitting actions on a given state, and the policy (mapping from states to actions) will be learned via self supervised learning (it is told or shown what next action should be given that input). So where is mesa-optimization here? How exactly does it map states to action? Is it like a transformer which is fine tuned for tasks? So this fine tuning will be like a mesa optimizer, i.e. it has this giant base optimizer that seems to be aligned with human values but when it does fine tuning for a specific task (say composing python programs) it creates a mesa-optimizer.