5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated “status quo”. Infrabayesian uncertainty about the dynamics is the final component that removes this incentive.
If you know which variables you want to remove the incentive to control, an alternative to penalising divergence is path-specific objectives, i.e. you compute the score function under an intervention on the model that sets the irrelevant variables to their status quo values. Then the AI has no incentive to control the variables, but no incentive to keep them the same either.
If you know which variables you want to remove the incentive to control, an alternative to penalising divergence is path-specific objectives, i.e. you compute the score function under an intervention on the model that sets the irrelevant variables to their status quo values. Then the AI has no incentive to control the variables, but no incentive to keep them the same either.
Nice, thanks for the pointer!