“RL optimization is what makes an LLM potentially dangerous. LLMs by themselves are just simulators, and therefore are not likely to become a misaligned intelligent agent. Therefore, a reasonable alignment strategy is to use LLMs, (and maybe small amounts of finetuning) to build useful superhuman helpers. Small quantities of finetuning won’t shift the LLM very far from being a simulator (which is safe), so it’ll probably still be safe.”
I even said this in a public presentation to people learning about alignment. Ahh.
I now think this was wrong. I was misled by ambient memes of the time and also made the mistake of trying too hard to update on the current top-performing technology. More recently, after understanding why this was wrong, I cowrote this post in the hope that it would be a good reference for why this belief was wrong, but I think it ended up trying to do too many other things. So here’s a briefer explanation:
The main problem was that I didn’t correctly condition on useful superhuman capability. Useful superhuman capabilities involve goal-directedness, in the sense that the algorithm must have some model of why certain actions lead to certain future outcomes. It must be choosing actions for a reason algorithmically downstream of with the intended outcome. This is the only way handle new obstacles and still succeed.
My reasoning was that since LLMs don’t seem to contain this sort of algorithm, and yet are still useful, then we can leverage that usefulness without danger of misalignment. This was pretty much correct. Without goals, there are no goals to be misaligned. It’s what we do today. The mistake was that I thought this would keep being true for future-LLMs-hypothetically-capable-of-research. I didn’t grok that goal-directedness at some level was necessary to cross the gap between LLM capabilities and research capability.
My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It’s a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
I was also wrong that LLMs should be thought of as simulators (although it’s a useful frame sometimes). There was a correct grain of truth in the idea that simulators would be safe. It would be great if we could train actual people-simulators. If we could build a real algorithm-level simulator of a person, this would of course be aligned (it would have the goals of the person simulated). But the way current LLMs are being built, and the way future systems will be built, they aren’t even vaguely trying to make extremely-robustly-generalizing people-simulators.[1] And they won’t, because it would involve a massive tradeoff with competence.
My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It’s a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
Distinguish two notions of “goal-directedness”:
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
My sense is that a high level of capability implies (2) but not (1).
Sure, kinda. But (2) is an unstable state. There’s at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it’s less clear exactly when.
A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn’t pick up free money on tuesdays because tuesday is I-don’t-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.
Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).
And that (1) is way more obviously dangerous
This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we’re conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.
If we all die because an AI put super-human amounts of optimization pressure into some goal incompatible with human survival (i.e., almost any goal if the optimization pressure is high enough) it does not matter whether the AI would have had some other goal in some other context.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
The whole approach is pretty hopeless IMHO: I mean the approach of “well, the AI will be wicked smart, but we’ll just make it so that it doesn’t want anything particularly badly or so that what it wants tomorrow will be different from what it wants today”.
It seems fairly certain to me that having a superhuman ability to do things that humans want to be done entails applying strong optimization pressure onto reality—pressure that persists as long as the AI is able to make it persist—forever, ideally, from the point of view of the AI. The two are not separate things like you hope they are. Either the AI is wicked good at steering reality towards a goal or not. If it is wicked good, then either its goal is compatible with continued human survival or not, and if not, we are all dead. If it is not wicked good at steering reality, then no one is going to be able to figure out how to use it to align an AI such that it stays aligned once it is much smarter than us.
I subscribe to MIRI’s current position that most of the hope for continued human survival comes from the (slim) hope that no one builds super-humanly smart AI until there are AI researchers that are significantly smarter and wiser than the current generation of AI designers (which will probably take centuries unless it proves much easier to employ technology to improve human cognition than most people think it is).
But what hope I have for alignment research done by currently-living people comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want—like Eliezer has been saying since 2006 or so.
By “non-world-destroying”, I assume you mean, “non-humanity ending”.
Well, yeah, if there were a way to keep AI models to roughly human capabilities that would be great because they would be unlikely to end humanity and because we could use them to do useful work with less expense (particularly, less energy expense and less CO2 emissions) than the expense of employing people.
But do you know of a safe way of making sure that, e.g., OpenAI’s next major training run will result in a model that is at most roughly human-level in every capability that can be used to end humanity or to put and to keep humanity in a situation that humanity would not want? I sure don’t—even if OpenAI were completely honest and cooperative with us.
The qualifier “safe” is present in the above paragraph / sentence because giving the model access to the internet (or to gullible people or to a compute farm where it can run any program it wants) then seeing what happens is only safe if we assume the thing to be proved, namely, that the model is not capable enough to impose its will on humanity.
But yeah, it is a source of hope (which I didn’t mention when I wrote, “what hope I have . . . comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want”) that someone will develop a method to keep AI capabilities to roughly human level and all labs actually use the method and focus on making the human-level AIs more efficient in resource consumption even during a great-powers war or an arms race between great powers.
I’d be more hopeful if I had ever seen a paper or a blog post by a researcher trying to devise such a method.
For completeness’s sake, let’s also point out that we could ban large training runs now worldwide, then the labs could concentrate on running the models they have now more efficiently and that would be safe (not completely safe, but much much safer than any future timeline we can realistically hope for) and would allow us to derive some of the benefits of the technology.
I do not know of such a way. I find it unlikely that OpenAI’s next training run wil result in a model that could end humanity, but I can provide no guarantees about that.
You seem to be assuming that all models above a certain threshold of capabilites will either exercise strong optimization pressure on the world in pursuit of goals, or will be useless. Put another way, you seem to be conflating capabilities with actually exerted world-optimization pressures.
While I agree that given a wide enough deployment it is likely that a given model will end up exercising its capabilities pretty much to their fullest extent, I hold that it is in principle possible to construct a mind that desires to help and is able to do so, yet also deliberately refrains from applying too much pressure.
I have encountered many people with the (according to me) mistaken model that you describe in your self-quote, and am glad to see this writeup. Indeed, I think the simulators frame frustratingly causes people to make this kind of update, which I think then causes people to get pretty confused about RL (and also to imagine some cartesian difference between next-token-prediction reward and long-term-agency reward, when the difference is actually purely a matter of degree of myopia).
Why wouldn’t myopic bias make it more likely to simulate than predict? And does’t empirical evidence about LLMs support the simulators frame? Like, what observations persuaded you, that we are not living in the world, where LLMs are simulators?
I don’t think there is any reason to assume the system is likely to choose “simulation” over “prediction”? And I don’t think we’ve observed any evidence of that.
The thing that is true, which I do think matters, is that if you train your AI system on only doing short single forward-passes, then it is less likely to get good at performing long chains of thought, since you never directly train it to do that (instead hoping that the single-step training generalizes to long chains of thought). That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
And I don’t think we’ve observed any evidence of that.
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn’t predictor exhibit more malign goal-directedness even for short term goals?
I can see that this whole story about modeling LLMs as predictors, and goal-directedness, and fundamental laws of cognition is logically coherent. But where is the connection to reality?
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
Yeah, I really don’t know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I’m saying that this won’t work with current systems at least for strong hash, because it’s hard, and instead of learning to undo, the model will learn to simulate, because it’s easier. And then you can vary the strength of hash to measure the degree of predictorness/simulatorness and compare it with what you expect. Or do a similar thing with something other than hash, that also distinguishes the two frames.
The point is that without experiments like these, how have you come to believe in the predictor frame?
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
I guess it is less about simulation being the right frame and more about prediction being the wrong one. But I think we have definitely observed LLMs mispredicting things we wouldn’t want them to predict. Or is this actually a crux and you haven’t seen any evidence at all against the predictor frame?
You can’t learn to simulate an undo of a hash, or at least I have no idea what you are “simulating” and why that would be “easier”. You are certainly not simulating the generation of the hash, going token by token forwards you don’t have access to a pre-image at that point.
Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order.
I’ve seen LLMs generating text backwards. Theoretically, LLM can keep pre-image in activations, calculate hash and then output in order hash, pre-image.
I think your current view and the one reflected in ‘Without fundamental advances...’ are probably ‘wrong-er’ than your previous view.
Useful superhuman capabilities involve goal-directedness, in the sense that the algorithm must have some model of why certain actions lead to certain future outcomes. It must be choosing actions for a reason algorithmically downstream of with the intended outcome. This is the only way handle new obstacles and still succeed.
I suspect this framing (which, maybe uncharitably, seems to me very much like a typical MIRI-agent-foundations-meme) is either wrong or, in any case, not very useful. At least it some sense, what happens at the superhuman level seems somewhat irrelevant if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts). And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
The same argument does apply to human-level generality.
if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts).
This is the part I think is unlikely. I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). This doesn’t necessarily imply x-risky before human-level safety research. I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”. There’s is also the risk of escape&foom, but that’s secondary.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”.
It seems to me that, until now at least, it’s been relatively easy to extract AI research out of LM agents (like in https://sakana.ai/ai-scientist/), with researchers (publicly, at least) only barely starting to try to extract research out of such systems. For now, not much cajoling seems to have been necessary, AFAICT—e.g. looking at their prompts, they seem pretty intuitive. Current outputs of https://sakana.ai/ai-scientist/ seem to me at least sometimes around workshop-paper level, and I expect even somewhat obvious improvements (e.g. more playing around with prompting, more inference time / reflection rounds, more GPU-time for experiments, etc.) to likely significantly increase the quality of the generated papers.
I think of how current systems work, at a high level, as something like the framing in Conditioning Predictive Models, so don’t see any obvious reason to expect much more cajoling to be necessary, at least for e.g. GPT-5 or GPT-6 as the base LLM and for near-term scaffolding improvements. I could see worries about sandbagging potentially changing this, but I expect this not to be a concern in the human-level safety research regime, since we have data for imitation learning (at least to bootstrap).
I am more uncertain about the comparatively worse feedback signals for more agent-foundations (vs. prosaic alignment) kinds of research, though even here, I’d expect automated reviewing + some human-in-the-loop feedback to go pretty far. And one can also ask the automated researchers to ground the conceptual research in more (automatically) verifiable artifacts, e.g. experiments in toy environments or math proofs, the way we often ask this of human researchers too.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
I also suspect we probably don’t need above-human-level in terms of research quality, because we can probably get huge amounts of automated human-level research, building on top of other automated human-level research. E.g. from What will GPT-2030 look like?: ‘GPT2030 can be copied arbitrarily and run in parallel. The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.’ For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work. And also, I don’t think it’s obvious that it wouldn’t be possible to also safely get that same amount of Paul-Christiano-level work (or pick your favourite alignment researcher, or plausibly any upper bound of human-level alignment research work from the training data, etc.).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
don’t see any obvious reason to expect much more cajoling to be necessary
It’s the difference in levels of goal-directedness. That’s the reason.
For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
I have done the same. I think Robert isn’t really responding to any critiques in the linked tweet thread, and I don’t think Nathan has thought that much about it. I could totally give you LW posts and papers of the Sakana AI quality, and they would be absolutely useless (which I know because I’ve spent like the last month working on getting intellectual labor out of AI systems).
I encourage you to try. I don’t think you would get any value out of running Sakana AI, and neither do you know anyonewho would.
To be clear, I don’t expect the current Sakana AI to produce anything revolutionary, and even if it somehow did, it would probably be hard to separate it from all the other less-good stuff it would produce. But I was surprised that it’s even as good as this, even having seen many of the workflow components in other papers previously (I would have guessed that it would take better base models to reliably string together all the components). And I think it might already e.g. plausibly come up with some decent preference learning variants, like some previous Sakana research (though it wasn’t automating the entire research workflow). So, given that I expect fast progress in the size of the base models (on top of the obvious possible improvements to the AI scientist, including by bringing in more stuff from other papers—e.g. following citation trails for ideas / novelty checks), improvements seem very likely. Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon. So I expect the most important components of ML and prosaic alignment research workflows to probably be (broadly speaking, and especially on tasks with relatively good, cheap proxy feedback) at least human-level in the next 3 years, in line with e.g. some Metaculus/Manifold predictions on IMO or IOI performance.
Taking all the above into account, I expect many parts of prosaic alignment research—and of ML research - (especially those with relatively short task horizons, requiring relatively little compute, and having decent proxies to measure performance) to be automatable soon (<= 3 years). I expect most of the work on improving Sakana-like systems to happen by default and be performed by capabilities researchers, but it would be nice to have safety-motivated researchers start experimenting, or at least thinking about how (e.g. on which tasks) to use such systems. I’ve done some thinking already (around which safety tasks/subdomains might be most suitable) and hope to publish some of it soon—and I might also start playing around with Sakana’s system.
I do expect things to be messier for generating more agent-foundations-type research (which I suspect might be closer to what you mean by ‘LW posts and papers’) - because it seems harder to get reliable feedback on the quality of the research, but even there, I expect at the very least quite strong human augmentation to be possible (e.g. >= 5x acceleration) - especially given that the automated reviewing part seems already pretty close to human-level, at least for ML papers.
Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon.
I think o1 is significant evidence in favor of this view.
And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
Jeremy’s response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching human-level capabilities), so let me address the second:
I am unimpressed by the output of the AI scientist. (To be clear, this is not the same thing as being unimpressed by the work put into it by its developers; it looks to me like they did a great job.) Mostly, however, the output looks to me basically like what I would have predicted, on my prior model of how scaffolding interacts with base models, which goes something like this:
A given model has some base distribution on the cognitive quality of its outputs, which is why resampling can sometimes produce better or worse responses to inputs. What scaffolding does is to essentially act as a more sophisticated form of sampling based on redundancy: having the model check its own output, respond to that output, etc. This can be very crudely viewed as an error correction process that drives down the probability that a “mistake” at some early token ends up propagating throughout the entirety of the scaffolding process and unduly influencing the output, which biases the quality distribution of outputs away from the lower tail and towards the upper tail.
The key moving piece on my model, however, is that all of this is still a function of the base distribution—a rough analogy here would be to best-of-n sampling. And the problem with best-of-n sampling, which looks to me like it carries over to more complicated scaffolding, is that as n increases, the mean of the resulting distribution increases as a sublinear (actually, logarithmic) function of n, while the variance decreases at a similar rate (but even this is misleading, since the resulting distribution will have negative skew, meaning variance decreases more rapidly in the upper tail than in the lower tail).
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers. And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of fulll paragraphs from papers, etc.
Ah, yeah, I can see how I might’ve been unclear there. I was implicitly taking CoT into account when I talked about the “base distribution” of the model’s outputs, as it’s essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model’s O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different distribution of outputs than the O(1) distribution.
In that sense, I readily admit CoT into the class of improvements I earlier characterized as “shifted distribution”. I just don’t think this gets you very far in terms of the overarching problem, since the recurrent O(n) distribution is the one whose output I find unimpressive, and the method that was used to obtain it from the (even less impressive) O(1) distribution is a one-time trick.[1]
And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of full paragraphs from papers, etc.
I also agree that another way to obtain a higher quality output distribution is to load relevant context from elsewhere. This once more seems to me like something of a red herring when it comes to the overarching question of how to get an LLM to produce human- or superhuman-level research; you can load its context with research humans have already done, but this is again a one-time trick, and not one that seems like it would enable novel research built atop the human-written research unless the base model possesses a baseline level of creativity and insight, etc.[2]
If you don’t already share (or at least understand) a good chunk of my intuitions here, the above probably sounds at least a little like I’m carving out special exceptions: conceding each point individually, while maintaining that they bear little on my core thesis. To address that, let me attempt to put a finger on some of the core intuitions I’m bringing to the table:
On my model of (good) scientific research de novo, a lot of key cognitive work occurs during what you might call “generation” and “synthesis”, where “generation” involves coming up with hypotheses that merit testing, picking the most promising of those, and designing a robust experiment that sheds insight; “synthesis” then consists of interpreting the experimental results so as to figure out the right takeaway (which very rarely ought to look like “we confirmed/disconfirmed the starting hypothesis”).
Neither of these steps are easily transmissible, since they hinge very tightly on a given individual’s research ability and intellectual “taste”; and neither of them tend to end up very well described in the writeups and papers that are released afterwards. This is hard stuff even for very bright humans, which implies to me that it requires a very high quality of thought to manage consistently. And it’s these steps that I don’t think scaffolding can help much with; I think the model has to be smart enough, at baseline, that its landscape of cognitive reachability contains these kinds of insights, before they can be elicited via an external method like scaffolding.[3]
I’m not sure whether you could theoretically obtain greater benefits from allowing more than O(n) iterations, but either way you’d start to bump up against context window limitations fairly quickly.
Consider the extreme case where we prompt the model with (among other things) a fully fleshed out solution to the AI alignment problem, before asking it to propose a workable solution to the AI alignment problem; it seems clear enough that in this case, almost all of the relevant cognitive work happened before the model even received its prompt.
I’m uncertain-leaning-yes on the question of whether you can get to a sufficiently “smart” base model via mere continued scaling of parameter count and data size; but that connects back to the original topic of whether said “smart” model would need to be capable of goal-directed thinking, on which I think I agree with Jeremy that it would; much of my model of good de novo research, described above, seems to me to draw on the same capabilities that characterize general-purpose goal-direction.
I suspect we probably have quite differing intuitions about what research processes/workflows tend to look like.
In my view, almost all research looks quite a lot (roughly) like iterative improvements on top of existing literature(s) or like literature-based discovery, combining already-existing concepts, often in pretty obvious ways (at least in retrospect). This probably applies even more to ML research, and quite significantly to prosaic safety research too. Even the more innovative kind of research, I think, often tends to look like combining existing concepts, just at a higher level of abstraction, or from more distanced/less-obviously-related fields. Almost zero research is properly de novo (not based on any existing—including multidisciplinary—literatures). (I might be biased though by my own research experience and taste, which draw very heavily on existing literatures.)
If this view is right, then LM agents might soon have an advantage even in the ideation stage, since they can do massive (e.g. semantic) retrieval at scale and much cheaper / faster than humans; + they might already have much longer short-term-memory equivalents (context windows). I suspect this might compensate a lot for them likely being worse at research taste (e.g. I’d suspect they’d still be worse if they could only test a very small number of ideas), especially when there are decent proxy signals and the iteration time is short and they can make a lot of tries cheaply; and I’d argue that a lot of prosaic safety research does seem to fall into this category. Even when it comes to the base models themselves, I’m unsure how much worse they are at this point (though I do think they are worse than the best researchers, at least). I often find Claude-3.5 to be very decent at (though maybe somewhat vaguely) combining a couple of different ideas from 2 or 3 papers, as long as they’re all in its context; while being very unlikely to be x-risky, since sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, etc.
After finishing MATS 2, I believed this:
“RL optimization is what makes an LLM potentially dangerous. LLMs by themselves are just simulators, and therefore are not likely to become a misaligned intelligent agent. Therefore, a reasonable alignment strategy is to use LLMs, (and maybe small amounts of finetuning) to build useful superhuman helpers. Small quantities of finetuning won’t shift the LLM very far from being a simulator (which is safe), so it’ll probably still be safe.”
I even said this in a public presentation to people learning about alignment. Ahh.
I now think this was wrong. I was misled by ambient memes of the time and also made the mistake of trying too hard to update on the current top-performing technology. More recently, after understanding why this was wrong, I cowrote this post in the hope that it would be a good reference for why this belief was wrong, but I think it ended up trying to do too many other things. So here’s a briefer explanation:
The main problem was that I didn’t correctly condition on useful superhuman capability. Useful superhuman capabilities involve goal-directedness, in the sense that the algorithm must have some model of why certain actions lead to certain future outcomes. It must be choosing actions for a reason algorithmically downstream of with the intended outcome. This is the only way handle new obstacles and still succeed.
My reasoning was that since LLMs don’t seem to contain this sort of algorithm, and yet are still useful, then we can leverage that usefulness without danger of misalignment. This was pretty much correct. Without goals, there are no goals to be misaligned. It’s what we do today. The mistake was that I thought this would keep being true for future-LLMs-hypothetically-capable-of-research. I didn’t grok that goal-directedness at some level was necessary to cross the gap between LLM capabilities and research capability.
My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It’s a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
I was also wrong that LLMs should be thought of as simulators (although it’s a useful frame sometimes). There was a correct grain of truth in the idea that simulators would be safe. It would be great if we could train actual people-simulators. If we could build a real algorithm-level simulator of a person, this would of course be aligned (it would have the goals of the person simulated). But the way current LLMs are being built, and the way future systems will be built, they aren’t even vaguely trying to make extremely-robustly-generalizing people-simulators.[1] And they won’t, because it would involve a massive tradeoff with competence.
And the level of OOD generalization required to remain a faithful simulation during online learning and reflection is intuitively quite high.
Distinguish two notions of “goal-directedness”:
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
Sure, kinda. But (2) is an unstable state. There’s at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it’s less clear exactly when.
A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn’t pick up free money on tuesdays because tuesday is I-don’t-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.
Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).
This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we’re conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.
If we all die because an AI put super-human amounts of optimization pressure into some goal incompatible with human survival (i.e., almost any goal if the optimization pressure is high enough) it does not matter whether the AI would have had some other goal in some other context.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
The whole approach is pretty hopeless IMHO: I mean the approach of “well, the AI will be wicked smart, but we’ll just make it so that it doesn’t want anything particularly badly or so that what it wants tomorrow will be different from what it wants today”.
It seems fairly certain to me that having a superhuman ability to do things that humans want to be done entails applying strong optimization pressure onto reality—pressure that persists as long as the AI is able to make it persist—forever, ideally, from the point of view of the AI. The two are not separate things like you hope they are. Either the AI is wicked good at steering reality towards a goal or not. If it is wicked good, then either its goal is compatible with continued human survival or not, and if not, we are all dead. If it is not wicked good at steering reality, then no one is going to be able to figure out how to use it to align an AI such that it stays aligned once it is much smarter than us.
I subscribe to MIRI’s current position that most of the hope for continued human survival comes from the (slim) hope that no one builds super-humanly smart AI until there are AI researchers that are significantly smarter and wiser than the current generation of AI designers (which will probably take centuries unless it proves much easier to employ technology to improve human cognition than most people think it is).
But what hope I have for alignment research done by currently-living people comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want—like Eliezer has been saying since 2006 or so.
An entity could have the ability to apply such strong optimization pressures onto reality, yet decide not to.
Such an entity would be useless to us IMHO.
Surely there exists a non-useless and non-world-destroying amount of optimization pressure?
By “non-world-destroying”, I assume you mean, “non-humanity ending”.
Well, yeah, if there were a way to keep AI models to roughly human capabilities that would be great because they would be unlikely to end humanity and because we could use them to do useful work with less expense (particularly, less energy expense and less CO2 emissions) than the expense of employing people.
But do you know of a safe way of making sure that, e.g., OpenAI’s next major training run will result in a model that is at most roughly human-level in every capability that can be used to end humanity or to put and to keep humanity in a situation that humanity would not want? I sure don’t—even if OpenAI were completely honest and cooperative with us.
The qualifier “safe” is present in the above paragraph / sentence because giving the model access to the internet (or to gullible people or to a compute farm where it can run any program it wants) then seeing what happens is only safe if we assume the thing to be proved, namely, that the model is not capable enough to impose its will on humanity.
But yeah, it is a source of hope (which I didn’t mention when I wrote, “what hope I have . . . comes mostly from the hope that someone will figure out how to make an ASI that genuinely wants the same things that we want”) that someone will develop a method to keep AI capabilities to roughly human level and all labs actually use the method and focus on making the human-level AIs more efficient in resource consumption even during a great-powers war or an arms race between great powers.
I’d be more hopeful if I had ever seen a paper or a blog post by a researcher trying to devise such a method.
For completeness’s sake, let’s also point out that we could ban large training runs now worldwide, then the labs could concentrate on running the models they have now more efficiently and that would be safe (not completely safe, but much much safer than any future timeline we can realistically hope for) and would allow us to derive some of the benefits of the technology.
I do not know of such a way. I find it unlikely that OpenAI’s next training run wil result in a model that could end humanity, but I can provide no guarantees about that.
You seem to be assuming that all models above a certain threshold of capabilites will either exercise strong optimization pressure on the world in pursuit of goals, or will be useless. Put another way, you seem to be conflating capabilities with actually exerted world-optimization pressures.
While I agree that given a wide enough deployment it is likely that a given model will end up exercising its capabilities pretty much to their fullest extent, I hold that it is in principle possible to construct a mind that desires to help and is able to do so, yet also deliberately refrains from applying too much pressure.
I have encountered many people with the (according to me) mistaken model that you describe in your self-quote, and am glad to see this writeup. Indeed, I think the simulators frame frustratingly causes people to make this kind of update, which I think then causes people to get pretty confused about RL (and also to imagine some cartesian difference between next-token-prediction reward and long-term-agency reward, when the difference is actually purely a matter of degree of myopia).
Why wouldn’t myopic bias make it more likely to simulate than predict? And does’t empirical evidence about LLMs support the simulators frame? Like, what observations persuaded you, that we are not living in the world, where LLMs are simulators?
I don’t think there is any reason to assume the system is likely to choose “simulation” over “prediction”? And I don’t think we’ve observed any evidence of that.
The thing that is true, which I do think matters, is that if you train your AI system on only doing short single forward-passes, then it is less likely to get good at performing long chains of thought, since you never directly train it to do that (instead hoping that the single-step training generalizes to long chains of thought). That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn’t predictor exhibit more malign goal-directedness even for short term goals?
I can see that this whole story about modeling LLMs as predictors, and goal-directedness, and fundamental laws of cognition is logically coherent. But where is the connection to reality?
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
Yeah, I really don’t know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I’m saying that this won’t work with current systems at least for strong hash, because it’s hard, and instead of learning to undo, the model will learn to simulate, because it’s easier. And then you can vary the strength of hash to measure the degree of predictorness/simulatorness and compare it with what you expect. Or do a similar thing with something other than hash, that also distinguishes the two frames.
The point is that without experiments like these, how have you come to believe in the predictor frame?
I guess it is less about simulation being the right frame and more about prediction being the wrong one. But I think we have definitely observed LLMs mispredicting things we wouldn’t want them to predict. Or is this actually a crux and you haven’t seen any evidence at all against the predictor frame?
You can’t learn to simulate an undo of a hash, or at least I have no idea what you are “simulating” and why that would be “easier”. You are certainly not simulating the generation of the hash, going token by token forwards you don’t have access to a pre-image at that point.
Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order.
I’ve seen LLMs generating text backwards. Theoretically, LLM can keep pre-image in activations, calculate hash and then output in order hash, pre-image.
I think your current view and the one reflected in ‘Without fundamental advances...’ are probably ‘wrong-er’ than your previous view.
I suspect this framing (which, maybe uncharitably, seems to me very much like a typical MIRI-agent-foundations-meme) is either wrong or, in any case, not very useful. At least it some sense, what happens at the superhuman level seems somewhat irrelevant if we can e.g. extract a lot of human-level safety research work safely (e.g. enough to obsolete previous human-produced efforts). And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood’s control agenda, etc.). I’d be curious where you’d disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they’re (roughly) human-level at safety research, or they never scale to human-level, etc.?
The same argument does apply to human-level generality.
This is the part I think is unlikely. I don’t really understand why people expect to be able to extract dramatically more safety research from AIs. It looks like it’s based on a naive extrapolation that doesn’t account for misalignment (setting aside AI-boxing plans). This doesn’t necessarily imply x-risky before human-level safety research. I’m just saying “should have goals, imprecisely specified” around the same time as it’s general enough to do research. So I expect it to be a pain to cajole this thing into doing as vaguely specified a task as “solve alignment properly”. There’s is also the risk of escape&foom, but that’s secondary.
One thing that might change my mind would be if we had exactly human researcher level models for >24 months, without capability improvement. With this much time, maybe sufficient experimentation with cajoling would get us something. But, at human level, I don’t really expect it to be much more useful than all MATS graduates spending a year after being told to “solve alignment properly”. If this is what the research quality is, then everyone will say “we need to make it smarter, it’s not making enough progress”. Then they’ll do that.
It seems to me that, until now at least, it’s been relatively easy to extract AI research out of LM agents (like in https://sakana.ai/ai-scientist/), with researchers (publicly, at least) only barely starting to try to extract research out of such systems. For now, not much cajoling seems to have been necessary, AFAICT—e.g. looking at their prompts, they seem pretty intuitive. Current outputs of https://sakana.ai/ai-scientist/ seem to me at least sometimes around workshop-paper level, and I expect even somewhat obvious improvements (e.g. more playing around with prompting, more inference time / reflection rounds, more GPU-time for experiments, etc.) to likely significantly increase the quality of the generated papers.
I think of how current systems work, at a high level, as something like the framing in Conditioning Predictive Models, so don’t see any obvious reason to expect much more cajoling to be necessary, at least for e.g. GPT-5 or GPT-6 as the base LLM and for near-term scaffolding improvements. I could see worries about sandbagging potentially changing this, but I expect this not to be a concern in the human-level safety research regime, since we have data for imitation learning (at least to bootstrap).
I am more uncertain about the comparatively worse feedback signals for more agent-foundations (vs. prosaic alignment) kinds of research, though even here, I’d expect automated reviewing + some human-in-the-loop feedback to go pretty far. And one can also ask the automated researchers to ground the conceptual research in more (automatically) verifiable artifacts, e.g. experiments in toy environments or math proofs, the way we often ask this of human researchers too.
I also suspect we probably don’t need above-human-level in terms of research quality, because we can probably get huge amounts of automated human-level research, building on top of other automated human-level research. E.g. from What will GPT-2030 look like?: ‘GPT2030 can be copied arbitrarily and run in parallel. The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.’ For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work. And also, I don’t think it’s obvious that it wouldn’t be possible to also safely get that same amount of Paul-Christiano-level work (or pick your favourite alignment researcher, or plausibly any upper bound of human-level alignment research work from the training data, etc.).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
It’s the difference in levels of goal-directedness. That’s the reason.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
The Sakana AI staff seems basically like vaporware: https://x.com/jimmykoppel/status/1828077203956850756
Strongly doubt the vaporware claim, having read the main text of the paper and some of the appendices. For some responses, see e.g. https://x.com/labenz/status/1828618276764541081 and https://x.com/RobertTLange/status/1829104906961093107.
I have done the same. I think Robert isn’t really responding to any critiques in the linked tweet thread, and I don’t think Nathan has thought that much about it. I could totally give you LW posts and papers of the Sakana AI quality, and they would be absolutely useless (which I know because I’ve spent like the last month working on getting intellectual labor out of AI systems).
I encourage you to try. I don’t think you would get any value out of running Sakana AI, and neither do you know anyonewho would.
To be clear, I don’t expect the current Sakana AI to produce anything revolutionary, and even if it somehow did, it would probably be hard to separate it from all the other less-good stuff it would produce. But I was surprised that it’s even as good as this, even having seen many of the workflow components in other papers previously (I would have guessed that it would take better base models to reliably string together all the components). And I think it might already e.g. plausibly come up with some decent preference learning variants, like some previous Sakana research (though it wasn’t automating the entire research workflow). So, given that I expect fast progress in the size of the base models (on top of the obvious possible improvements to the AI scientist, including by bringing in more stuff from other papers—e.g. following citation trails for ideas / novelty checks), improvements seem very likely. Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it’s much easier to generate (including superhuman-level) verifiable, synthetic training data—so that it’s hard to be confident models won’t get superhuman in these domains soon. So I expect the most important components of ML and prosaic alignment research workflows to probably be (broadly speaking, and especially on tasks with relatively good, cheap proxy feedback) at least human-level in the next 3 years, in line with e.g. some Metaculus/Manifold predictions on IMO or IOI performance.
Taking all the above into account, I expect many parts of prosaic alignment research—and of ML research - (especially those with relatively short task horizons, requiring relatively little compute, and having decent proxies to measure performance) to be automatable soon (<= 3 years). I expect most of the work on improving Sakana-like systems to happen by default and be performed by capabilities researchers, but it would be nice to have safety-motivated researchers start experimenting, or at least thinking about how (e.g. on which tasks) to use such systems. I’ve done some thinking already (around which safety tasks/subdomains might be most suitable) and hope to publish some of it soon—and I might also start playing around with Sakana’s system.
I do expect things to be messier for generating more agent-foundations-type research (which I suspect might be closer to what you mean by ‘LW posts and papers’) - because it seems harder to get reliable feedback on the quality of the research, but even there, I expect at the very least quite strong human augmentation to be possible (e.g. >= 5x acceleration) - especially given that the automated reviewing part seems already pretty close to human-level, at least for ML papers.
I think o1 is significant evidence in favor of this view.
Jeremy’s response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching human-level capabilities), so let me address the second:
I am unimpressed by the output of the AI scientist. (To be clear, this is not the same thing as being unimpressed by the work put into it by its developers; it looks to me like they did a great job.) Mostly, however, the output looks to me basically like what I would have predicted, on my prior model of how scaffolding interacts with base models, which goes something like this:
A given model has some base distribution on the cognitive quality of its outputs, which is why resampling can sometimes produce better or worse responses to inputs. What scaffolding does is to essentially act as a more sophisticated form of sampling based on redundancy: having the model check its own output, respond to that output, etc. This can be very crudely viewed as an error correction process that drives down the probability that a “mistake” at some early token ends up propagating throughout the entirety of the scaffolding process and unduly influencing the output, which biases the quality distribution of outputs away from the lower tail and towards the upper tail.
The key moving piece on my model, however, is that all of this is still a function of the base distribution—a rough analogy here would be to best-of-n sampling. And the problem with best-of-n sampling, which looks to me like it carries over to more complicated scaffolding, is that as n increases, the mean of the resulting distribution increases as a sublinear (actually, logarithmic) function of n, while the variance decreases at a similar rate (but even this is misleading, since the resulting distribution will have negative skew, meaning variance decreases more rapidly in the upper tail than in the lower tail).
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where “unreasonable” here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of “useful [superhuman] capabilities” Jeremy is worried about.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers. And also intuitively, I expect, for example, that Sakana’s agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of fulll paragraphs from papers, etc.
Ah, yeah, I can see how I might’ve been unclear there. I was implicitly taking CoT into account when I talked about the “base distribution” of the model’s outputs, as it’s essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model’s O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different distribution of outputs than the O(1) distribution.
In that sense, I readily admit CoT into the class of improvements I earlier characterized as “shifted distribution”. I just don’t think this gets you very far in terms of the overarching problem, since the recurrent O(n) distribution is the one whose output I find unimpressive, and the method that was used to obtain it from the (even less impressive) O(1) distribution is a one-time trick.[1]
I also agree that another way to obtain a higher quality output distribution is to load relevant context from elsewhere. This once more seems to me like something of a red herring when it comes to the overarching question of how to get an LLM to produce human- or superhuman-level research; you can load its context with research humans have already done, but this is again a one-time trick, and not one that seems like it would enable novel research built atop the human-written research unless the base model possesses a baseline level of creativity and insight, etc.[2]
If you don’t already share (or at least understand) a good chunk of my intuitions here, the above probably sounds at least a little like I’m carving out special exceptions: conceding each point individually, while maintaining that they bear little on my core thesis. To address that, let me attempt to put a finger on some of the core intuitions I’m bringing to the table:
On my model of (good) scientific research de novo, a lot of key cognitive work occurs during what you might call “generation” and “synthesis”, where “generation” involves coming up with hypotheses that merit testing, picking the most promising of those, and designing a robust experiment that sheds insight; “synthesis” then consists of interpreting the experimental results so as to figure out the right takeaway (which very rarely ought to look like “we confirmed/disconfirmed the starting hypothesis”).
Neither of these steps are easily transmissible, since they hinge very tightly on a given individual’s research ability and intellectual “taste”; and neither of them tend to end up very well described in the writeups and papers that are released afterwards. This is hard stuff even for very bright humans, which implies to me that it requires a very high quality of thought to manage consistently. And it’s these steps that I don’t think scaffolding can help much with; I think the model has to be smart enough, at baseline, that its landscape of cognitive reachability contains these kinds of insights, before they can be elicited via an external method like scaffolding.[3]
I’m not sure whether you could theoretically obtain greater benefits from allowing more than O(n) iterations, but either way you’d start to bump up against context window limitations fairly quickly.
Consider the extreme case where we prompt the model with (among other things) a fully fleshed out solution to the AI alignment problem, before asking it to propose a workable solution to the AI alignment problem; it seems clear enough that in this case, almost all of the relevant cognitive work happened before the model even received its prompt.
I’m uncertain-leaning-yes on the question of whether you can get to a sufficiently “smart” base model via mere continued scaling of parameter count and data size; but that connects back to the original topic of whether said “smart” model would need to be capable of goal-directed thinking, on which I think I agree with Jeremy that it would; much of my model of good de novo research, described above, seems to me to draw on the same capabilities that characterize general-purpose goal-direction.
I suspect we probably have quite differing intuitions about what research processes/workflows tend to look like.
In my view, almost all research looks quite a lot (roughly) like iterative improvements on top of existing literature(s) or like literature-based discovery, combining already-existing concepts, often in pretty obvious ways (at least in retrospect). This probably applies even more to ML research, and quite significantly to prosaic safety research too. Even the more innovative kind of research, I think, often tends to look like combining existing concepts, just at a higher level of abstraction, or from more distanced/less-obviously-related fields. Almost zero research is properly de novo (not based on any existing—including multidisciplinary—literatures). (I might be biased though by my own research experience and taste, which draw very heavily on existing literatures.)
If this view is right, then LM agents might soon have an advantage even in the ideation stage, since they can do massive (e.g. semantic) retrieval at scale and much cheaper / faster than humans; + they might already have much longer short-term-memory equivalents (context windows). I suspect this might compensate a lot for them likely being worse at research taste (e.g. I’d suspect they’d still be worse if they could only test a very small number of ideas), especially when there are decent proxy signals and the iteration time is short and they can make a lot of tries cheaply; and I’d argue that a lot of prosaic safety research does seem to fall into this category. Even when it comes to the base models themselves, I’m unsure how much worse they are at this point (though I do think they are worse than the best researchers, at least). I often find Claude-3.5 to be very decent at (though maybe somewhat vaguely) combining a couple of different ideas from 2 or 3 papers, as long as they’re all in its context; while being very unlikely to be x-risky, since sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, etc.