Re question 3 (is it possible to detect the presence of a mesa-optimiser from only its input-output behavior?), I was recently mulling over the following idea.
Suppose that on some task with some loss function, ML models tend to produce losses distributed in some predictable way (e.g. the losses are usually exponentially distributed, with the parameter λ getting larger the better your model is). Now if our model has a mesa-optimiser inside of it, I would expect the loss to be distributed like min(X1,…,Xn) where the Xi are drawn from some distribution of the sort we originally expected. This minimum might be distributed differently than our originally expected distribution, giving a way to detect a mesa-optimizer just by looking at the model’s losses!
Some caveats:
This won’t work if we originally expected our losses to be exponentially distributed, because the minimum of a collection of exponential random variables is another exponential random variable.
The more decoupled the mesa-objective is from the base-objective the less this reasoning will apply (since the mesa-optimiser will output the option which minimizes the mesa-loss, not the base-loss). That said, my intuition is that if you have a mesa-optimiser your loss distribution should still look weird somehow (since the output of an argmin should generally be distributed differently than the options among which the argmin selects).
This intuition must break down in extreme cases though: a deceptively aligned mesa-optimiser smart enough to know we’re watching the loss distribution for signs of a mesa-optimiser could ensure that its outputs are distributed the way we expect.
Re question 3 (is it possible to detect the presence of a mesa-optimiser from only its input-output behavior?), I was recently mulling over the following idea.
Suppose that on some task with some loss function, ML models tend to produce losses distributed in some predictable way (e.g. the losses are usually exponentially distributed, with the parameter λ getting larger the better your model is). Now if our model has a mesa-optimiser inside of it, I would expect the loss to be distributed like min(X1,…,Xn) where the Xi are drawn from some distribution of the sort we originally expected. This minimum might be distributed differently than our originally expected distribution, giving a way to detect a mesa-optimizer just by looking at the model’s losses!
Some caveats:
This won’t work if we originally expected our losses to be exponentially distributed, because the minimum of a collection of exponential random variables is another exponential random variable.
The more decoupled the mesa-objective is from the base-objective the less this reasoning will apply (since the mesa-optimiser will output the option which minimizes the mesa-loss, not the base-loss). That said, my intuition is that if you have a mesa-optimiser your loss distribution should still look weird somehow (since the output of an argmin should generally be distributed differently than the options among which the argmin selects).
This intuition must break down in extreme cases though: a deceptively aligned mesa-optimiser smart enough to know we’re watching the loss distribution for signs of a mesa-optimiser could ensure that its outputs are distributed the way we expect.