Can’t we avoid this just by being careful about credit assignment?
If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.
If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.
This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.
The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.
The first problem with any superintelligent predictive setup is self-fulfilling prophecies.
Can’t we avoid this just by being careful about credit assignment?
If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.
If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.
This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.
The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.