That can be folded into the utility function, however. Just make the ratings of the deferential person mostly copy the ratings of their partner.
Presumably the deferential parter could just use a utility function which is a weighted combination of their partner’s and their own (selfish) one. For instance, the deferential partner could use a utility function like , where is the utility function of the partner and is the utility function of the deferential person accounting only for their weak personal preferences and not their altruism.
Obviously the weights could depend on the level of altruism, the strength of the partner’s preferences, whether they are reporting their true preferences or the preferences such that the outcome will be what they want, etc. But this type of deferential preference can still be described by a utility function.
I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity.