I guess one crux for sharing research on LM agents is whether there are viable alternative paths to powerful AI systems. If LM-agents is clearly the easiest path, there’s less reason to share research on them; if a less-safe path looks similarly easy, we should differentially advance LM-agents.
I’m not aware of alternative paths that look anywhere near as easy as LM-agents. Or: I don’t know what viable alternative paths LM-agents are supposed to be safer than. (Edit: some alignment researcher friends mention old-fashioned RL agents as a possible path to powerful AI that’s less safe than LM-agents but say that path looks substantially harder than LM-agents, such that we don’t need to boost LM-agents more.)
Maybe rather than ‘different paths’ Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:
at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.
I buy something like this, at least. But (I weakly intuit) we’ll almost exclusively be relying on LM agents rather than mere next-token-predictors by default; there’s no need to boost LM agents. And even if that’s good, that doesn’t mean that marginal improvements in LM agents’ sophistication/complexity are safer than marginal improvements in underlying-LM-capability. (I don’t have a take on this—just flagging it as a crux.)
My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.
I guess one crux for sharing research on LM agents is whether there are viable alternative paths to powerful AI systems. If LM-agents is clearly the easiest path, there’s less reason to share research on them; if a less-safe path looks similarly easy, we should differentially advance LM-agents.
I’m not aware of alternative paths that look anywhere near as easy as LM-agents. Or: I don’t know what viable alternative paths LM-agents are supposed to be safer than. (Edit: some alignment researcher friends mention old-fashioned RL agents as a possible path to powerful AI that’s less safe than LM-agents but say that path looks substantially harder than LM-agents, such that we don’t need to boost LM-agents more.)
Maybe rather than ‘different paths’ Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:
I buy something like this, at least. But (I weakly intuit) we’ll almost exclusively be relying on LM agents rather than mere next-token-predictors by default; there’s no need to boost LM agents. And even if that’s good, that doesn’t mean that marginal improvements in LM agents’ sophistication/complexity are safer than marginal improvements in underlying-LM-capability. (I don’t have a take on this—just flagging it as a crux.)
My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.