That mostly sounds pretty compatible with this post?
For instance, the self-model part: on this post’s model, the human uses their usual epistemic machinery—i.e. world model—in the process of modeling rewards. That world model includes a self-model. So insofar as X and me-pursuing-X generate different rewards, the human would naturally represent those rewards as generated by different components of value, i.e. they’d estimate different value for X vs me-pursuing-X.
That mostly sounds pretty compatible with this post?
For instance, the self-model part: on this post’s model, the human uses their usual epistemic machinery—i.e. world model—in the process of modeling rewards. That world model includes a self-model. So insofar as X and me-pursuing-X generate different rewards, the human would naturally represent those rewards as generated by different components of value, i.e. they’d estimate different value for X vs me-pursuing-X.