I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples
To give some examples from the text:
A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations
But I could easily train an AI which simply classifies chess moves by quality. What takes that to being an agent is just the fact that its outputs are labelled as ‘moves’ rather than as ‘classifications’, rather than any feature of the model itself. More generally, even a LM can be viewed as “merely” predicting next tokens—the fact that there is some perspective from which a system is non-agentic does not actually tell us very much.
Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.
I think it’s a stretch to say something generating hypotheses about the world has no situational awareness and no persistent goals—maybe it has indexical uncertainty, but a sufficiently powerful system is pretty likely to hypothesise about itself, and the equivalent of persistent goals can easily fall out of any ways its world model doesn’t line up with reality. Note that this doesn’t assume the AI has any ‘hidden goals’ or that it ever makes inaccurate predictions.
I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:
The counterfactual query proposal basically breaks down in the face of collusion
The point about isolating the training process from the real world says that “a reward-maximizing agent alters the real world to increase its reward”, which I think is importantly wrong. In general, I think the distinctions drawn here between RL and the science AI all break down at high levels.
The uniqueness of solutions still leaves a degree of freedom in how the AI fills in details we don’t know—it might be able to, for example, pick between several world models that fit the data which each offer a different set of entirely consistent answers to all our questions. If it’s sufficiently superintelligent, we wouldn’t be able to monitor whether it was even exercising that freedom.
Overall, I’m excited by the direction, but it doesn’t feel like this approach actually gets any assurances of safety, or any fundamental advantages.
Ah I should emphasise, I do think all of these things could help—it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I’ll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
It’s easy to accidentally scaffold some kind of agent out of an oracle as soon as there’s any kind of consistent causal process from the oracle’s outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I’m not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals—in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to ‘jump the gap’, e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude