The distinction I’m most interested in here is the distinction drawn in “Risks from Learned Optimization” between internalization and modeling. As described in the paper, however, there are really three cases:
In the internalization case, the base optimizer (e.g. gradient descent) modifies the mesa-optimizer’s parameters to explicitly specify its goal.
In the corrigible modeling case, the mesa-optimizer builds a world model containing concepts like “what humans value” and then the base optimizer modifies the mesa-optimizer’s objective to “point to” whatever is found by the world model.
In the deceptive modeling case, the mesa-optimizer builds a world model containing concepts like “what humans value” and then the base optimizer modifies the mesa-optimizer to instrumentally optimize that for the purpose of staying around.
Which one of these situations is most likely seems highly dependent on to what extent goals by default get encoded explicitly or via reference to world models. This is a hard thing to formalize in such a way that we can make any progress on testing it now, but my “Reward side-channels” proposal at least tries to by drawing the following distinction: if I give the model its reward as part of its observations, train it to optimize that, but then change it at test time, what happens? If it’s doing pure internalization, it should either fail to optimize for either reward very well or succeed at optimizing the old reward but not the new one. If it’s doing pure modeling, however, then it should fail to optimize the old reward but succeed at optimizing the new reward, which is the interesting case that you’d be trying to see under what circumstances you could actually get to appear.
The distinction I’m most interested in here is the distinction drawn in “Risks from Learned Optimization” between internalization and modeling. As described in the paper, however, there are really three cases:
In the internalization case, the base optimizer (e.g. gradient descent) modifies the mesa-optimizer’s parameters to explicitly specify its goal.
In the corrigible modeling case, the mesa-optimizer builds a world model containing concepts like “what humans value” and then the base optimizer modifies the mesa-optimizer’s objective to “point to” whatever is found by the world model.
In the deceptive modeling case, the mesa-optimizer builds a world model containing concepts like “what humans value” and then the base optimizer modifies the mesa-optimizer to instrumentally optimize that for the purpose of staying around.
Which one of these situations is most likely seems highly dependent on to what extent goals by default get encoded explicitly or via reference to world models. This is a hard thing to formalize in such a way that we can make any progress on testing it now, but my “Reward side-channels” proposal at least tries to by drawing the following distinction: if I give the model its reward as part of its observations, train it to optimize that, but then change it at test time, what happens? If it’s doing pure internalization, it should either fail to optimize for either reward very well or succeed at optimizing the old reward but not the new one. If it’s doing pure modeling, however, then it should fail to optimize the old reward but succeed at optimizing the new reward, which is the interesting case that you’d be trying to see under what circumstances you could actually get to appear.