I think your current view and the one reflected in ‘Without fundamental advances...’ are probably ‘wrong-er’ than your previous view.
Useful superhuman capabilities involve goal-directedness, in the sense that the algorithm must have some model of why certain actions lead to certain future outcomes. It must be choosing actions for a reason algorithmically downstream of with the intended outcome. This is the only way handle new obstacles and still succeed.
I suspect this framing (which, maybe uncharitably, seems to me very much like a typical MIRI-agent-foundations-meme) is either wrong or, in any case, not very useful. At least it some sense, what happens at the superhuman level seems somewhat irrelevant if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts). And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
The same argument does apply to human-level generality.
if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts).
This is the part I think is unlikely. I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). This doesn’t necessarily imply x-risky before human-level safety research. I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”. There’s is also the risk of escape&foom, but that’s secondary.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”.
It seems to me that, until now at least, it’s been relatively easy to extract AI research out of LM agents (like in https://sakana.ai/ai-scientist/), with researchers (publicly, at least) only barely starting to try to extract research out of such systems. For now, not much cajoling seems to have been necessary, AFAICT—e.g. looking at their prompts, they seem pretty intuitive. Current outputs of https://sakana.ai/ai-scientist/ seem to me at least sometimes around workshop-paper level, and I expect even somewhat obvious improvements (e.g. more playing around with prompting, more inference time / reflection rounds, more GPU-time for experiments, etc.) to likely significantly increase the quality of the generated papers.
I think of how current systems work, at a high level, as something like the framing in Conditioning Predictive Models, so don’t see any obvious reason to expect much more cajoling to be necessary, at least for e.g. GPT-5 or GPT-6 as the base LLM and for near-term scaffolding improvements. I could see worries about sandbagging potentially changing this, but I expect this not to be a concern in the human-level safety research regime, since we have data for imitation learning (at least to bootstrap).
I am more uncertain about the comparatively worse feedback signals for more agent-foundations (vs. prosaic alignment) kinds of research, though even here, I’d expect automated reviewing + some human-in-the-loop feedback to go pretty far. And one can also ask the automated researchers to ground the conceptual research in more (automatically) verifiable artifacts, e.g. experiments in toy environments or math proofs, the way we often ask this of human researchers too.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
I also suspect we probably don’t need above-human-level in terms of research quality, because we can probably get huge amounts of automated human-level research, building on top of other automated human-level research. E.g. from What will GPT-2030 look like?: ‘GPT2030 can be copied arbitrarily and run in parallel. The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.’ For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work. And also, I don’t think it’s obvious that it wouldn’t be possible to also safely get that same amount of Paul-Christiano-level work (or pick your favourite alignment researcher, or plausibly any upper bound of human-level alignment research work from the training data, etc.).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
don’t see any obvious reason to expect much more cajoling to be necessary
It’s the difference in levels of goal-directedness. That’s the reason.
For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
I have done the same. I think Robert isn’t really responding to any critiques in the linked tweet thread, and I don’t think Nathan has thought that much about it. I could totally give you LW posts and papers of the Sakana AI quality, and they would be absolutely useless (which I know because I’ve spent like the last month working on getting intellectual labor out of AI systems).
I encourage you to try. I don’t think you would get any value out of running Sakana AI, and neither do you know anyonewho would.
To be clear, I don’t expect the current Sakana AI to produce anything revolutionary, and even if it somehow did, it would probably be hard to separate it from all the other less-good stuff it would produce. But I was surprised that it’s even as good as this, even having seen many of the workflow components in other papers previously (I would have guessed that it would take better base models to reliably string together all the components). And I think it might already e.g. plausibly come up with some decent preference learning variants, like some previous Sakana research (though it wasn’t automating the entire research workflow). So, given that I expect fast progress in the size of the base models (on top of the obvious possible improvements to the AI scientist, including by bringing in more stuff from other papers—e.g. following citation trails for ideas / novelty checks), improvements seem very likely. Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon. So I expect the most important components of ML and prosaic alignment research workflows to probably be (broadly speaking, and especially on tasks with relatively good, cheap proxy feedback) at least human-level in the next 3 years, in line with e.g. some Metaculus/Manifold predictions on IMO or IOI performance.
Taking all the above into account, I expect many parts of prosaic alignment research—and of ML research - (especially those with relatively short task horizons, requiring relatively little compute, and having decent proxies to measure performance) to be automatable soon (<= 3 years). I expect most of the work on improving Sakana-like systems to happen by default and be performed by capabilities researchers, but it would be nice to have safety-motivated researchers start experimenting, or at least thinking about how (e.g. on which tasks) to use such systems. I’ve done some thinking already (around which safety tasks/subdomains might be most suitable) and hope to publish some of it soon—and I might also start playing around with Sakana’s system.
I do expect things to be messier for generating more agent-foundations-type research (which I suspect might be closer to what you mean by ‘LW posts and papers’) - because it seems harder to get reliable feedback on the quality of the research, but even there, I expect at the very least quite strong human augmentation to be possible (e.g. >= 5x acceleration) - especially given that the automated reviewing part seems already pretty close to human-level, at least for ML papers.
Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon.
I think o1 is significant evidence in favor of this view.
And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
Jeremy’s response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching human-level capabilities), so let me address the second:
I am unimpressed by the output of the AI scientist. (To be clear, this is not the same thing as being unimpressed by the work put into it by its developers; it looks to me like they did a great job.) Mostly, however, the output looks to me basically like what I would have predicted, on my prior model of how scaffolding interacts with base models, which goes something like this:
A given model has some base distribution on the cognitive quality of its outputs, which is why resampling can sometimes produce better or worse responses to inputs. What scaffolding does is to essentially act as a more sophisticated form of sampling based on redundancy: having the model check its own output, respond to that output, etc. This can be very crudely viewed as an error correction process that drives down the probability that a “mistake” at some early token ends up propagating throughout the entirety of the scaffolding process and unduly influencing the output, which biases the quality distribution of outputs away from the lower tail and towards the upper tail.
The key moving piece on my model, however, is that all of this is still a function of the base distribution—a rough analogy here would be to best-of-n sampling. And the problem with best-of-n sampling, which looks to me like it carries over to more complicated scaffolding, is that as n increases, the mean of the resulting distribution increases as a sublinear (actually, logarithmic) function of n, while the variance decreases at a similar rate (but even this is misleading, since the resulting distribution will have negative skew, meaning variance decreases more rapidly in the upper tail than in the lower tail).
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers. And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of fulll paragraphs from papers, etc.
Ah, yeah, I can see how I might’ve been unclear there. I was implicitly taking CoT into account when I talked about the “base distribution” of the model’s outputs, as it’s essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model’s O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different distribution of outputs than the O(1) distribution.
In that sense, I readily admit CoT into the class of improvements I earlier characterized as “shifted distribution”. I just don’t think this gets you very far in terms of the overarching problem, since the recurrent O(n) distribution is the one whose output I find unimpressive, and the method that was used to obtain it from the (even less impressive) O(1) distribution is a one-time trick.[1]
And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of full paragraphs from papers, etc.
I also agree that another way to obtain a higher quality output distribution is to load relevant context from elsewhere. This once more seems to me like something of a red herring when it comes to the overarching question of how to get an LLM to produce human- or superhuman-level research; you can load its context with research humans have already done, but this is again a one-time trick, and not one that seems like it would enable novel research built atop the human-written research unless the base model possesses a baseline level of creativity and insight, etc.[2]
If you don’t already share (or at least understand) a good chunk of my intuitions here, the above probably sounds at least a little like I’m carving out special exceptions: conceding each point individually, while maintaining that they bear little on my core thesis. To address that, let me attempt to put a finger on some of the core intuitions I’m bringing to the table:
On my model of (good) scientific research de novo, a lot of key cognitive work occurs during what you might call “generation” and “synthesis”, where “generation” involves coming up with hypotheses that merit testing, picking the most promising of those, and designing a robust experiment that sheds insight; “synthesis” then consists of interpreting the experimental results so as to figure out the right takeaway (which very rarely ought to look like “we confirmed/disconfirmed the starting hypothesis”).
Neither of these steps are easily transmissible, since they hinge very tightly on a given individual’s research ability and intellectual “taste”; and neither of them tend to end up very well described in the writeups and papers that are released afterwards. This is hard stuff even for very bright humans, which implies to me that it requires a very high quality of thought to manage consistently. And it’s these steps that I don’t think scaffolding can help much with; I think the model has to be smart enough, at baseline, that its landscape of cognitive reachability contains these kinds of insights, before they can be elicited via an external method like scaffolding.[3]
I’m not sure whether you could theoretically obtain greater benefits from allowing more than O(n) iterations, but either way you’d start to bump up against context window limitations fairly quickly.
Consider the extreme case where we prompt the model with (among other things) a fully fleshed out solution to the AI alignment problem, before asking it to propose a workable solution to the AI alignment problem; it seems clear enough that in this case, almost all of the relevant cognitive work happened before the model even received its prompt.
I’m uncertain-leaning-yes on the question of whether you can get to a sufficiently “smart” base model via mere continued scaling of parameter count and data size; but that connects back to the original topic of whether said “smart” model would need to be capable of goal-directed thinking, on which I think I agree with Jeremy that it would; much of my model of good de novo research, described above, seems to me to draw on the same capabilities that characterize general-purpose goal-direction.
I suspect we probably have quite differing intuitions about what research processes/workflows tend to look like.
In my view, almost all research looks quite a lot (roughly) like iterative improvements on top of existing literature(s) or like literature-based discovery, combining already-existing concepts, often in pretty obvious ways (at least in retrospect). This probably applies even more to ML research, and quite significantly to prosaic safety research too. Even the more innovative kind of research, I think, often tends to look like combining existing concepts, just at a higher level of abstraction, or from more distanced/less-obviously-related fields. Almost zero research is properly de novo (not based on any existing—including multidisciplinary—literatures). (I might be biased though by my own research experience and taste, which draw very heavily on existing literatures.)
If this view is right, then LM agents might soon have an advantage even in the ideation stage, since they can do massive (e.g. semantic) retrieval at scale and much cheaper / faster than humans; + they might already have much longer short-term-memory equivalents (context windows). I suspect this might compensate a lot for them likely being worse at research taste (e.g. I’d suspect they’d still be worse if they could only test a very small number of ideas), especially when there are decent proxy signals and the iteration time is short and they can make a lot of tries cheaply; and I’d argue that a lot of prosaic safety research does seem to fall into this category. Even when it comes to the base models themselves, I’m unsure how much worse they are at this point (though I do think they are worse than the best researchers, at least). I often find Claude-3.5 to be very decent at (though maybe somewhat vaguely) combining a couple of different ideas from 2 or 3 papers, as long as they’re all in its context; while being very unlikely to be x-risky, since sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, etc.
I think your current view and the one reflected in ‘Without fundamental advances...’ are probably ‘wrong-er’ than your previous view.
I suspect this framing (which, maybe uncharitably, seems to me very much like a typical MIRI-agent-foundations-meme) is either wrong or, in any case, not very useful. At least it some sense, what happens at the superhuman level seems somewhat irrelevant if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts). And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
The same argument does apply to human-level generality.
This is the part I think is unlikely. I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). This doesn’t necessarily imply x-risky before human-level safety research. I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”. There’s is also the risk of escape&foom, but that’s secondary.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
It seems to me that, until now at least, it’s been relatively easy to extract AI research out of LM agents (like in https://sakana.ai/ai-scientist/), with researchers (publicly, at least) only barely starting to try to extract research out of such systems. For now, not much cajoling seems to have been necessary, AFAICT—e.g. looking at their prompts, they seem pretty intuitive. Current outputs of https://sakana.ai/ai-scientist/ seem to me at least sometimes around workshop-paper level, and I expect even somewhat obvious improvements (e.g. more playing around with prompting, more inference time / reflection rounds, more GPU-time for experiments, etc.) to likely significantly increase the quality of the generated papers.
I think of how current systems work, at a high level, as something like the framing in Conditioning Predictive Models, so don’t see any obvious reason to expect much more cajoling to be necessary, at least for e.g. GPT-5 or GPT-6 as the base LLM and for near-term scaffolding improvements. I could see worries about sandbagging potentially changing this, but I expect this not to be a concern in the human-level safety research regime, since we have data for imitation learning (at least to bootstrap).
I am more uncertain about the comparatively worse feedback signals for more agent-foundations (vs. prosaic alignment) kinds of research, though even here, I’d expect automated reviewing + some human-in-the-loop feedback to go pretty far. And one can also ask the automated researchers to ground the conceptual research in more (automatically) verifiable artifacts, e.g. experiments in toy environments or math proofs, the way we often ask this of human researchers too.
I also suspect we probably don’t need above-human-level in terms of research quality, because we can probably get huge amounts of automated human-level research, building on top of other automated human-level research. E.g. from What will GPT-2030 look like?: ‘GPT2030 can be copied arbitrarily and run in parallel. The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.’ For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work. And also, I don’t think it’s obvious that it wouldn’t be possible to also safely get that same amount of Paul-Christiano-level work (or pick your favourite alignment researcher, or plausibly any upper bound of human-level alignment research work from the training data, etc.).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
It’s the difference in levels of goal-directedness. That’s the reason.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
The Sakana AI staff seems basically like vaporware: https://x.com/jimmykoppel/status/1828077203956850756
Strongly doubt the vaporware claim, having read the main text of the paper and some of the appendices. For some responses, see e.g. https://x.com/labenz/status/1828618276764541081 and https://x.com/RobertTLange/status/1829104906961093107.
I have done the same. I think Robert isn’t really responding to any critiques in the linked tweet thread, and I don’t think Nathan has thought that much about it. I could totally give you LW posts and papers of the Sakana AI quality, and they would be absolutely useless (which I know because I’ve spent like the last month working on getting intellectual labor out of AI systems).
I encourage you to try. I don’t think you would get any value out of running Sakana AI, and neither do you know anyonewho would.
To be clear, I don’t expect the current Sakana AI to produce anything revolutionary, and even if it somehow did, it would probably be hard to separate it from all the other less-good stuff it would produce. But I was surprised that it’s even as good as this, even having seen many of the workflow components in other papers previously (I would have guessed that it would take better base models to reliably string together all the components). And I think it might already e.g. plausibly come up with some decent preference learning variants, like some previous Sakana research (though it wasn’t automating the entire research workflow). So, given that I expect fast progress in the size of the base models (on top of the obvious possible improvements to the AI scientist, including by bringing in more stuff from other papers—e.g. following citation trails for ideas / novelty checks), improvements seem very likely. Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon. So I expect the most important components of ML and prosaic alignment research workflows to probably be (broadly speaking, and especially on tasks with relatively good, cheap proxy feedback) at least human-level in the next 3 years, in line with e.g. some Metaculus/Manifold predictions on IMO or IOI performance.
Taking all the above into account, I expect many parts of prosaic alignment research—and of ML research - (especially those with relatively short task horizons, requiring relatively little compute, and having decent proxies to measure performance) to be automatable soon (<= 3 years). I expect most of the work on improving Sakana-like systems to happen by default and be performed by capabilities researchers, but it would be nice to have safety-motivated researchers start experimenting, or at least thinking about how (e.g. on which tasks) to use such systems. I’ve done some thinking already (around which safety tasks/subdomains might be most suitable) and hope to publish some of it soon—and I might also start playing around with Sakana’s system.
I do expect things to be messier for generating more agent-foundations-type research (which I suspect might be closer to what you mean by ‘LW posts and papers’) - because it seems harder to get reliable feedback on the quality of the research, but even there, I expect at the very least quite strong human augmentation to be possible (e.g. >= 5x acceleration) - especially given that the automated reviewing part seems already pretty close to human-level, at least for ML papers.
I think o1 is significant evidence in favor of this view.
Jeremy’s response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching human-level capabilities), so let me address the second:
I am unimpressed by the output of the AI scientist. (To be clear, this is not the same thing as being unimpressed by the work put into it by its developers; it looks to me like they did a great job.) Mostly, however, the output looks to me basically like what I would have predicted, on my prior model of how scaffolding interacts with base models, which goes something like this:
A given model has some base distribution on the cognitive quality of its outputs, which is why resampling can sometimes produce better or worse responses to inputs. What scaffolding does is to essentially act as a more sophisticated form of sampling based on redundancy: having the model check its own output, respond to that output, etc. This can be very crudely viewed as an error correction process that drives down the probability that a “mistake” at some early token ends up propagating throughout the entirety of the scaffolding process and unduly influencing the output, which biases the quality distribution of outputs away from the lower tail and towards the upper tail.
The key moving piece on my model, however, is that all of this is still a function of the base distribution—a rough analogy here would be to best-of-n sampling. And the problem with best-of-n sampling, which looks to me like it carries over to more complicated scaffolding, is that as n increases, the mean of the resulting distribution increases as a sublinear (actually, logarithmic) function of n, while the variance decreases at a similar rate (but even this is misleading, since the resulting distribution will have negative skew, meaning variance decreases more rapidly in the upper tail than in the lower tail).
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers. And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of fulll paragraphs from papers, etc.
Ah, yeah, I can see how I might’ve been unclear there. I was implicitly taking CoT into account when I talked about the “base distribution” of the model’s outputs, as it’s essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model’s O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different distribution of outputs than the O(1) distribution.
In that sense, I readily admit CoT into the class of improvements I earlier characterized as “shifted distribution”. I just don’t think this gets you very far in terms of the overarching problem, since the recurrent O(n) distribution is the one whose output I find unimpressive, and the method that was used to obtain it from the (even less impressive) O(1) distribution is a one-time trick.[1]
I also agree that another way to obtain a higher quality output distribution is to load relevant context from elsewhere. This once more seems to me like something of a red herring when it comes to the overarching question of how to get an LLM to produce human- or superhuman-level research; you can load its context with research humans have already done, but this is again a one-time trick, and not one that seems like it would enable novel research built atop the human-written research unless the base model possesses a baseline level of creativity and insight, etc.[2]
If you don’t already share (or at least understand) a good chunk of my intuitions here, the above probably sounds at least a little like I’m carving out special exceptions: conceding each point individually, while maintaining that they bear little on my core thesis. To address that, let me attempt to put a finger on some of the core intuitions I’m bringing to the table:
On my model of (good) scientific research de novo, a lot of key cognitive work occurs during what you might call “generation” and “synthesis”, where “generation” involves coming up with hypotheses that merit testing, picking the most promising of those, and designing a robust experiment that sheds insight; “synthesis” then consists of interpreting the experimental results so as to figure out the right takeaway (which very rarely ought to look like “we confirmed/disconfirmed the starting hypothesis”).
Neither of these steps are easily transmissible, since they hinge very tightly on a given individual’s research ability and intellectual “taste”; and neither of them tend to end up very well described in the writeups and papers that are released afterwards. This is hard stuff even for very bright humans, which implies to me that it requires a very high quality of thought to manage consistently. And it’s these steps that I don’t think scaffolding can help much with; I think the model has to be smart enough, at baseline, that its landscape of cognitive reachability contains these kinds of insights, before they can be elicited via an external method like scaffolding.[3]
I’m not sure whether you could theoretically obtain greater benefits from allowing more than O(n) iterations, but either way you’d start to bump up against context window limitations fairly quickly.
Consider the extreme case where we prompt the model with (among other things) a fully fleshed out solution to the AI alignment problem, before asking it to propose a workable solution to the AI alignment problem; it seems clear enough that in this case, almost all of the relevant cognitive work happened before the model even received its prompt.
I’m uncertain-leaning-yes on the question of whether you can get to a sufficiently “smart” base model via mere continued scaling of parameter count and data size; but that connects back to the original topic of whether said “smart” model would need to be capable of goal-directed thinking, on which I think I agree with Jeremy that it would; much of my model of good de novo research, described above, seems to me to draw on the same capabilities that characterize general-purpose goal-direction.
I suspect we probably have quite differing intuitions about what research processes/workflows tend to look like.
In my view, almost all research looks quite a lot (roughly) like iterative improvements on top of existing literature(s) or like literature-based discovery, combining already-existing concepts, often in pretty obvious ways (at least in retrospect). This probably applies even more to ML research, and quite significantly to prosaic safety research too. Even the more innovative kind of research, I think, often tends to look like combining existing concepts, just at a higher level of abstraction, or from more distanced/less-obviously-related fields. Almost zero research is properly de novo (not based on any existing—including multidisciplinary—literatures). (I might be biased though by my own research experience and taste, which draw very heavily on existing literatures.)
If this view is right, then LM agents might soon have an advantage even in the ideation stage, since they can do massive (e.g. semantic) retrieval at scale and much cheaper / faster than humans; + they might already have much longer short-term-memory equivalents (context windows). I suspect this might compensate a lot for them likely being worse at research taste (e.g. I’d suspect they’d still be worse if they could only test a very small number of ideas), especially when there are decent proxy signals and the iteration time is short and they can make a lot of tries cheaply; and I’d argue that a lot of prosaic safety research does seem to fall into this category. Even when it comes to the base models themselves, I’m unsure how much worse they are at this point (though I do think they are worse than the best researchers, at least). I often find Claude-3.5 to be very decent at (though maybe somewhat vaguely) combining a couple of different ideas from 2 or 3 papers, as long as they’re all in its context; while being very unlikely to be x-risky, since sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, etc.