I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”.
It seems to me that, until now at least, it’s been relatively easy to extract AI research out of LM agents (like in https://sakana.ai/ai-scientist/), with researchers (publicly, at least) only barely starting to try to extract research out of such systems. For now, not much cajoling seems to have been necessary, AFAICT—e.g. looking at their prompts, they seem pretty intuitive. Current outputs of https://sakana.ai/ai-scientist/ seem to me at least sometimes around workshop-paper level, and I expect even somewhat obvious improvements (e.g. more playing around with prompting, more inference time / reflection rounds, more GPU-time for experiments, etc.) to likely significantly increase the quality of the generated papers.
I think of how current systems work, at a high level, as something like the framing in Conditioning Predictive Models, so don’t see any obvious reason to expect much more cajoling to be necessary, at least for e.g. GPT-5 or GPT-6 as the base LLM and for near-term scaffolding improvements. I could see worries about sandbagging potentially changing this, but I expect this not to be a concern in the human-level safety research regime, since we have data for imitation learning (at least to bootstrap).
I am more uncertain about the comparatively worse feedback signals for more agent-foundations (vs. prosaic alignment) kinds of research, though even here, I’d expect automated reviewing + some human-in-the-loop feedback to go pretty far. And one can also ask the automated researchers to ground the conceptual research in more (automatically) verifiable artifacts, e.g. experiments in toy environments or math proofs, the way we often ask this of human researchers too.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
I also suspect we probably don’t need above-human-level in terms of research quality, because we can probably get huge amounts of automated human-level research, building on top of other automated human-level research. E.g. from What will GPT-2030 look like?: ‘GPT2030 can be copied arbitrarily and run in parallel. The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.’ For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work. And also, I don’t think it’s obvious that it wouldn’t be possible to also safely get that same amount of Paul-Christiano-level work (or pick your favourite alignment researcher, or plausibly any upper bound of human-level alignment research work from the training data, etc.).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
don’t see any obvious reason to expect much more cajoling to be necessary
It’s the difference in levels of goal-directedness. That’s the reason.
For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
I have done the same. I think Robert isn’t really responding to any critiques in the linked tweet thread, and I don’t think Nathan has thought that much about it. I could totally give you LW posts and papers of the Sakana AI quality, and they would be absolutely useless (which I know because I’ve spent like the last month working on getting intellectual labor out of AI systems).
I encourage you to try. I don’t think you would get any value out of running Sakana AI, and neither do you know anyonewho would.
To be clear, I don’t expect the current Sakana AI to produce anything revolutionary, and even if it somehow did, it would probably be hard to separate it from all the other less-good stuff it would produce. But I was surprised that it’s even as good as this, even having seen many of the workflow components in other papers previously (I would have guessed that it would take better base models to reliably string together all the components). And I think it might already e.g. plausibly come up with some decent preference learning variants, like some previous Sakana research (though it wasn’t automating the entire research workflow). So, given that I expect fast progress in the size of the base models (on top of the obvious possible improvements to the AI scientist, including by bringing in more stuff from other papers—e.g. following citation trails for ideas / novelty checks), improvements seem very likely. Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon. So I expect the most important components of ML and prosaic alignment research workflows to probably be (broadly speaking, and especially on tasks with relatively good, cheap proxy feedback) at least human-level in the next 3 years, in line with e.g. some Metaculus/Manifold predictions on IMO or IOI performance.
Taking all the above into account, I expect many parts of prosaic alignment research—and of ML research - (especially those with relatively short task horizons, requiring relatively little compute, and having decent proxies to measure performance) to be automatable soon (<= 3 years). I expect most of the work on improving Sakana-like systems to happen by default and be performed by capabilities researchers, but it would be nice to have safety-motivated researchers start experimenting, or at least thinking about how (e.g. on which tasks) to use such systems. I’ve done some thinking already (around which safety tasks/subdomains might be most suitable) and hope to publish some of it soon—and I might also start playing around with Sakana’s system.
I do expect things to be messier for generating more agent-foundations-type research (which I suspect might be closer to what you mean by ‘LW posts and papers’) - because it seems harder to get reliable feedback on the quality of the research, but even there, I expect at the very least quite strong human augmentation to be possible (e.g. >= 5x acceleration) - especially given that the automated reviewing part seems already pretty close to human-level, at least for ML papers.
Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon.
I think o1 is significant evidence in favor of this view.
It seems to me that, until now at least, it’s been relatively easy to extract AI research out of LM agents (like in https://sakana.ai/ai-scientist/), with researchers (publicly, at least) only barely starting to try to extract research out of such systems. For now, not much cajoling seems to have been necessary, AFAICT—e.g. looking at their prompts, they seem pretty intuitive. Current outputs of https://sakana.ai/ai-scientist/ seem to me at least sometimes around workshop-paper level, and I expect even somewhat obvious improvements (e.g. more playing around with prompting, more inference time / reflection rounds, more GPU-time for experiments, etc.) to likely significantly increase the quality of the generated papers.
I think of how current systems work, at a high level, as something like the framing in Conditioning Predictive Models, so don’t see any obvious reason to expect much more cajoling to be necessary, at least for e.g. GPT-5 or GPT-6 as the base LLM and for near-term scaffolding improvements. I could see worries about sandbagging potentially changing this, but I expect this not to be a concern in the human-level safety research regime, since we have data for imitation learning (at least to bootstrap).
I am more uncertain about the comparatively worse feedback signals for more agent-foundations (vs. prosaic alignment) kinds of research, though even here, I’d expect automated reviewing + some human-in-the-loop feedback to go pretty far. And one can also ask the automated researchers to ground the conceptual research in more (automatically) verifiable artifacts, e.g. experiments in toy environments or math proofs, the way we often ask this of human researchers too.
I also suspect we probably don’t need above-human-level in terms of research quality, because we can probably get huge amounts of automated human-level research, building on top of other automated human-level research. E.g. from What will GPT-2030 look like?: ‘GPT2030 can be copied arbitrarily and run in parallel. The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.’ For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work. And also, I don’t think it’s obvious that it wouldn’t be possible to also safely get that same amount of Paul-Christiano-level work (or pick your favourite alignment researcher, or plausibly any upper bound of human-level alignment research work from the training data, etc.).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
It’s the difference in levels of goal-directedness. That’s the reason.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
The Sakana AI staff seems basically like vaporware: https://x.com/jimmykoppel/status/1828077203956850756
Strongly doubt the vaporware claim, having read the main text of the paper and some of the appendices. For some responses, see e.g. https://x.com/labenz/status/1828618276764541081 and https://x.com/RobertTLange/status/1829104906961093107.
I have done the same. I think Robert isn’t really responding to any critiques in the linked tweet thread, and I don’t think Nathan has thought that much about it. I could totally give you LW posts and papers of the Sakana AI quality, and they would be absolutely useless (which I know because I’ve spent like the last month working on getting intellectual labor out of AI systems).
I encourage you to try. I don’t think you would get any value out of running Sakana AI, and neither do you know anyonewho would.
To be clear, I don’t expect the current Sakana AI to produce anything revolutionary, and even if it somehow did, it would probably be hard to separate it from all the other less-good stuff it would produce. But I was surprised that it’s even as good as this, even having seen many of the workflow components in other papers previously (I would have guessed that it would take better base models to reliably string together all the components). And I think it might already e.g. plausibly come up with some decent preference learning variants, like some previous Sakana research (though it wasn’t automating the entire research workflow). So, given that I expect fast progress in the size of the base models (on top of the obvious possible improvements to the AI scientist, including by bringing in more stuff from other papers—e.g. following citation trails for ideas / novelty checks), improvements seem very likely. Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon. So I expect the most important components of ML and prosaic alignment research workflows to probably be (broadly speaking, and especially on tasks with relatively good, cheap proxy feedback) at least human-level in the next 3 years, in line with e.g. some Metaculus/Manifold predictions on IMO or IOI performance.
Taking all the above into account, I expect many parts of prosaic alignment research—and of ML research - (especially those with relatively short task horizons, requiring relatively little compute, and having decent proxies to measure performance) to be automatable soon (<= 3 years). I expect most of the work on improving Sakana-like systems to happen by default and be performed by capabilities researchers, but it would be nice to have safety-motivated researchers start experimenting, or at least thinking about how (e.g. on which tasks) to use such systems. I’ve done some thinking already (around which safety tasks/subdomains might be most suitable) and hope to publish some of it soon—and I might also start playing around with Sakana’s system.
I do expect things to be messier for generating more agent-foundations-type research (which I suspect might be closer to what you mean by ‘LW posts and papers’) - because it seems harder to get reliable feedback on the quality of the research, but even there, I expect at the very least quite strong human augmentation to be possible (e.g. >= 5x acceleration) - especially given that the automated reviewing part seems already pretty close to human-level, at least for ML papers.
I think o1 is significant evidence in favor of this view.