I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn’t have to be that strong.
[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details—as next paragraph:]
There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, as you wrote having a model of other agents is obviously generally useful, and it may require having an approximation of both their worlds models and value functions as part of your world model. Unless you have huge amounts of data and compute, you are going to reuse your own world model as theirs, with small corrections on top. But this is about your world model, not your value function.
[The part that help your argument. Epistemic status: Many speculative details, but ones that I find pretty convincing, at least before multiplying their probabilities]
Except having the value function of other agents in your world model, and having the mechanisim for predicting their action as part of your world-model-update, is basically replicating computations that you already have in your actor and critic, in a more general form. Your original actor and critique are then likely to simplify to “do the things that my model of myself would, and value the results as much as my model of myself would” + some corrections. In that stage, if the “some corrections” part is not too heavy, you may have some confusion of the kind that you described. Of course, it will still be optimized against.
BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others’ (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.
The main problem with that mechanism is that you liking chocolate will probably leak as “its good for me too to eat chocolate”, not “its good for me too when beren eat chocolate”—which is more likely to cause conflict then coordination, if there is only that much chocolate.
And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals—as the terminal goals of each of us is a noisy approximation of evolution’s “goal” of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)
I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is—this is exactly what we see with empathy in humans! This is definitely not proposed as a full ‘solution’ to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to ‘caring’ about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.
I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn’t have to be that strong.
[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details—as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, as you wrote having a model of other agents is obviously generally useful, and it may require having an approximation of both their worlds models and value functions as part of your world model. Unless you have huge amounts of data and compute, you are going to reuse your own world model as theirs, with small corrections on top. But this is about your world model, not your value function.
[The part that help your argument. Epistemic status: Many speculative details, but ones that I find pretty convincing, at least before multiplying their probabilities] Except having the value function of other agents in your world model, and having the mechanisim for predicting their action as part of your world-model-update, is basically replicating computations that you already have in your actor and critic, in a more general form. Your original actor and critique are then likely to simplify to “do the things that my model of myself would, and value the results as much as my model of myself would” + some corrections. In that stage, if the “some corrections” part is not too heavy, you may have some confusion of the kind that you described. Of course, it will still be optimized against.
BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others’ (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.
The main problem with that mechanism is that you liking chocolate will probably leak as “its good for me too to eat chocolate”, not “its good for me too when beren eat chocolate”—which is more likely to cause conflict then coordination, if there is only that much chocolate.
And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals—as the terminal goals of each of us is a noisy approximation of evolution’s “goal” of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)
I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is—this is exactly what we see with empathy in humans! This is definitely not proposed as a full ‘solution’ to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to ‘caring’ about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.
We agree 😀
What do you think about some brainstorming in the chat about how to use that hook?