The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.