I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
Conclusion: Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval [0,1], there’s no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.
But shard theorists mainly aim to address agency obtained via DPO-like setups, and @TurnTrout has mathematically proved that such setups don’t favor the power-seeking drives AI safety researchers are usually concerned about in the context of agency.
I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
Ok. Then I’ll say that randomly assigned utility over full trajectories are beyond wild!
The basin of attraction just needs to be large enough. AIs will intentionally be created with more structure than that.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.