How much should we worry about mesa-optimization challenges?

This post was written quickly, lest I do not write it at all.

Picture the following scenario.

Humans train a model, M, with the intention for M to minimize a loss function L.
The model, M, will now take a set of actions.

I see this going wrong in two ways.

It is possible that L is malformed (misaligned, specifically), such that effectively decreasing L kills everyone. This is the classic paperclip maximizer scenario. We currently do not know how to design L such that this does not happen.
Even if L is not malformed, the set of actions taken by M might be catastrophic. This is the mesa-optimization problem.

The first failure case has captured most of my attention. Meanwhile, I have been somewhat dismissive of the second failure case.

I would like to explain why I was dismissive of the mesa-optimization problem, and make an argument for why I think we should in fact take it seriously.

We understand that M is an optimizer. However, we can also assume that M is not a perfect optimizer. On out-of-distribution data, M is likely to fail to optimize L.

We can define a new loss function, L’, which M actually does perfectly optimize for. We define L’ such that the more resources M has, the more effective M will be in decreasing L’.

L’ is not taken from “human-designed objective function” space. In fact, my intuition states that L’ is likely to look very strange and complex. If we were to attempt to extract the utility function from a heavily intelligence-enhanced human based on their actions, I doubt that such a utility function would seem simple either. This intuition made me initially dismissive of mesa-optimization being a problem.

Despite having read Omuhundro’s [AI Drives](https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf) paper, there did not seem to me as if there was any obvious reason why we should assume the sort of strange L’-like objective functions to suffer from instrumental convergence. One can certainly imagine many objective functions that do not lead to these drives. One could even imagine an objective function which rewards having less resources, skill, or rationality.

It might be the case that most utility functions sampled from the space of all possible utility functions converge to having these drives, but that did not and does not seem like an obviously true fact to me.

I can’t find the post, but someone on LessWrong wrote something along the lines of “only a tiny sliver of possible worlds are compatible with human existence.” This seemed like an obviously true fact, and I’d intuit that it applies to biological sentience more broadly.

That was the “aha” moment for me. Without understanding L’ more deeply, we can begin by assuming that L’ is sampled from “objective function space” instead of “human-like objective function space.”

I think it is the maximum entropy assumption that the terminal goal-states^[1] of functions in “objective function space” are uniformly distributed across all the possible states. Since only a tiny sliver of possible states are compatible with biological sentience, we should expect a highly effective L’ optimizer to be incompatible with human life.

Luckily, we have a bit^[2] more than 0 bit of information about L’. For example, we know that with enough training, L’ can be very similar to L.

I think it might be worth exploring other things that we expect to be true about L’.

^
We define a goal-state of an objective function to be a world which minimizes loss for the objective function.
^
;)