Nice post! It was clear, and I agree that knowing more about the basin of attraction is useful. I also like that you caveat the usefulness of this idea yourself.
Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button—the AI would try to figure out what we mean, rather than just doing what we say.
This makes me think about Inverse Reward Design, when the reward signal given is interpreted as an intention with the context of these specific training environments.
More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our Pk[X|‘‘M”] sequence shows what that tower looks like for one particular model of Alice/Bob.
Makes me think of Common Knowledge, as defined for distributed computing: ϕ is common knowledge iff everyone know that everyone knows that …. that everyone knows ϕ. That probably only apply to the idealized case, but it might be another way to look at it.
Nice post! It was clear, and I agree that knowing more about the basin of attraction is useful. I also like that you caveat the usefulness of this idea yourself.
This makes me think about Inverse Reward Design, when the reward signal given is interpreted as an intention with the context of these specific training environments.
Makes me think of Common Knowledge, as defined for distributed computing: ϕ is common knowledge iff everyone know that everyone knows that …. that everyone knows ϕ. That probably only apply to the idealized case, but it might be another way to look at it.