I feel like there’s a bit of a motte and bailey in AI risk discussion, where the bailey is “building safe non-agentic AI is difficult” and the motte is “somebody else will surely build agentic AI”.
Are there any really compelling arguments for the bailey? If not then I think “build an oracle and ask it how to avoid risk from other people building agents” is an excellent alignment plan.
AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.
Isn’t this a central example of “somebody else will surely build agentic AI?”.
I guess it argues “building safe non-agentic AI before somebody else builds agentic AI is difficult” because agents have a capability advantage.
This may well be true (but also perhaps not, because e.g. agents might have capability disadvantages from misalignment, or because reinforcement learning is just harder than other forms of ML).
But either way I think it has importantly different strategy implications to “it seems difficult to make non-agentic AI safe”.
I think the main problem with building safe non-agentic AI is that we don’t know exactly what to build. It’s easy to imagine how you type question in terminal, get an answer and then live happily ever after. It’s hard to imagine what internals your algorithm should have to display this behaviour.
I think the most obvious route to building an oracle is to combine a massive self-supervised predictive model with a question-answering head.
What’s still difficult here is getting a training signal that incentives truthfulness rather than sycophancy, which is I think is what ARC‘s ELK stuff wants (wanted?) to address. Really good mechinterp, new inherently interpretable architectures, or inductive bias-jitsu are other potential approaches.
But the other difficult aspects of the alignment problem (avoiding deceptive alignment, goalcraft) seem to just go away when you drop the agency.
Can’t we avoid this just by being careful about credit assignment?
If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.
If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.
This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.
The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.
I feel like there’s a bit of a motte and bailey in AI risk discussion, where the bailey is “building safe non-agentic AI is difficult” and the motte is “somebody else will surely build agentic AI”.
Are there any really compelling arguments for the bailey? If not then I think “build an oracle and ask it how to avoid risk from other people building agents” is an excellent alignment plan.
For example
Isn’t this a central example of “somebody else will surely build agentic AI?”.
I guess it argues “building safe non-agentic AI before somebody else builds agentic AI is difficult” because agents have a capability advantage.
This may well be true (but also perhaps not, because e.g. agents might have capability disadvantages from misalignment, or because reinforcement learning is just harder than other forms of ML).
But either way I think it has importantly different strategy implications to “it seems difficult to make non-agentic AI safe”.
Oh, sorry, I misread your post.
I think the main problem with building safe non-agentic AI is that we don’t know exactly what to build. It’s easy to imagine how you type question in terminal, get an answer and then live happily ever after. It’s hard to imagine what internals your algorithm should have to display this behaviour.
I think the most obvious route to building an oracle is to combine a massive self-supervised predictive model with a question-answering head.
What’s still difficult here is getting a training signal that incentives truthfulness rather than sycophancy, which is I think is what ARC‘s ELK stuff wants (wanted?) to address. Really good mechinterp, new inherently interpretable architectures, or inductive bias-jitsu are other potential approaches.
But the other difficult aspects of the alignment problem (avoiding deceptive alignment, goalcraft) seem to just go away when you drop the agency.
The first problem with any superintelligent predictive setup is self-fulfilling prophecies.
Can’t we avoid this just by being careful about credit assignment?
If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.
If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.
This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.
The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.