What is this trying to solve ? If you try to setup an ML system to be a mimic, can you insure you don’t get an inner misaligned mesa optimizer ? In general, what part of AI safety is this supposed to help with? A new design for alignable AGI? A design to slap onto existing ML systems, like a better version of RLHF?
I’m not proposing a new design here—large language models are (approximately) mimics, and large language models with RLHF are (approximately) mimics controlled using reinforcement learning with a KL penalty, which I describe here. I’m proposing the outline of a theory that, with more work, might help us better understand control of mimics. In particular, if we have an AI that is close enough to satisfying the right assumptions (1. it is approximately a Bayesian learner, 2. its actions involve drawing samples from its predictive distribution, 3. there is sufficient overlap between the operator and the mimic’s priors and 4. they share data to learn from), then we can use the operator’s model of the training data to predict the effects of the AI actions. This is useful because, for example, if a proxy F is highly correlated with some “true obejctive” G in the training data, under the right conditions it will also be highly correlated in the data produced by the AI’s actions.
The theory I outline also goes into when this ideal level of controllability should be expected to fail—when the machine is pushed to exert control over some variable that it normally only has weak control. This is a meaningful extension to previous work on problems with proxy goals—for example, Manheim and Garrabrant do not identify weak control as a condition under which Goodhart type problems should be expected to be particularly severe.
“Misaligned mesa optimizer” is a broad term that covers a lot of very different problems. One kind of problem comes under what I would call “convergence issues”—the speculative proposition that, if you throw too much compute at a distribution learning problem, at some point instead of getting better predictions you instead get manipulation. This is really speculative—it’s supported by ideas like “the Solomonoff Prior is malign”, but it’s unclear how practically relevant these theories are.
A different problem, also sometimes understood in terms of “misaligned mesa optimizers”, is that AI will simply get the wrong idea about the control we are trying to apply to it (“goal misgeneralisation”). The theory directly addresses this problem: under the assumptions presented, goal misgeneralisation does not happen when we tune the mimic to exert control on some variable that is, in any case, normally fully controllable by the mimic. Goal misgeneralisation should be expected in the case of imperfect control, and the amount of goal misgeneralisation can be estimated if you can bound the difference between P(Xn|Rn=0,X<n) and P(Xn|Xn∈XF,X<n).
Many big problems in AI safety lack adequate theories, so if someone is proposing new AI designs no one can really say whether they solve key problems or not. In this post, I explain a promising approach to improving the theory of AI control.
What is this trying to solve ?
If you try to setup an ML system to be a mimic, can you insure you don’t get an inner misaligned mesa optimizer ?
In general, what part of AI safety is this supposed to help with? A new design for alignable AGI? A design to slap onto existing ML systems, like a better version of RLHF?
I’m not proposing a new design here—large language models are (approximately) mimics, and large language models with RLHF are (approximately) mimics controlled using reinforcement learning with a KL penalty, which I describe here. I’m proposing the outline of a theory that, with more work, might help us better understand control of mimics. In particular, if we have an AI that is close enough to satisfying the right assumptions (1. it is approximately a Bayesian learner, 2. its actions involve drawing samples from its predictive distribution, 3. there is sufficient overlap between the operator and the mimic’s priors and 4. they share data to learn from), then we can use the operator’s model of the training data to predict the effects of the AI actions. This is useful because, for example, if a proxy F is highly correlated with some “true obejctive” G in the training data, under the right conditions it will also be highly correlated in the data produced by the AI’s actions.
The theory I outline also goes into when this ideal level of controllability should be expected to fail—when the machine is pushed to exert control over some variable that it normally only has weak control. This is a meaningful extension to previous work on problems with proxy goals—for example, Manheim and Garrabrant do not identify weak control as a condition under which Goodhart type problems should be expected to be particularly severe.
“Misaligned mesa optimizer” is a broad term that covers a lot of very different problems. One kind of problem comes under what I would call “convergence issues”—the speculative proposition that, if you throw too much compute at a distribution learning problem, at some point instead of getting better predictions you instead get manipulation. This is really speculative—it’s supported by ideas like “the Solomonoff Prior is malign”, but it’s unclear how practically relevant these theories are.
A different problem, also sometimes understood in terms of “misaligned mesa optimizers”, is that AI will simply get the wrong idea about the control we are trying to apply to it (“goal misgeneralisation”). The theory directly addresses this problem: under the assumptions presented, goal misgeneralisation does not happen when we tune the mimic to exert control on some variable that is, in any case, normally fully controllable by the mimic. Goal misgeneralisation should be expected in the case of imperfect control, and the amount of goal misgeneralisation can be estimated if you can bound the difference between P(Xn|Rn=0,X<n) and P(Xn|Xn∈XF,X<n).
Many big problems in AI safety lack adequate theories, so if someone is proposing new AI designs no one can really say whether they solve key problems or not. In this post, I explain a promising approach to improving the theory of AI control.