Ah I should emphasise, I do think all of these things could help—it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I’ll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
It’s easy to accidentally scaffold some kind of agent out of an oracle as soon as there’s any kind of consistent causal process from the oracle’s outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I’m not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals—in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to ‘jump the gap’, e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude
Ah I should emphasise, I do think all of these things could help—it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I’ll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
It’s easy to accidentally scaffold some kind of agent out of an oracle as soon as there’s any kind of consistent causal process from the oracle’s outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I’m not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals—in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to ‘jump the gap’, e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude