I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is—this is exactly what we see with empathy in humans! This is definitely not proposed as a full ‘solution’ to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to ‘caring’ about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.
I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is—this is exactly what we see with empathy in humans! This is definitely not proposed as a full ‘solution’ to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to ‘caring’ about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.
We agree 😀
What do you think about some brainstorming in the chat about how to use that hook?