Okay, let’s be blunt here. I don’t think most of the discourse about alignment being really hard is being generated by models of machine learning at all. I don’t think we’re looking at wrong models; I think we’re looking at no models.
I was once at a conference where there was a panel full of famous AI alignment luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI alignment is really hard and unaddressed by modern alignment research, except for two famous AI luminaries who stayed quiet and let others take the microphone.
I got up in Q&A and said, “Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.”
There was a silence.
Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. They named “Running a factory competently.”
A few months after that panel, there was unexpectedly a big breakthrough on LLM/management integration.
The point is the silence that fell after my question, and that eventually I only got one reply, spoken in tentative tones. When I asked for concrete feats that were impossible in the next two years, I think that that’s when the luminaries on that panel switched to trying to build a mental model of future progress in AI alignment, asking themselves what they could or couldn’t predict, what they knew or didn’t know. And to their credit, most of them did know their profession well enough to realize that forecasting future boundaries around a rapidly moving field is actually really hard, that nobody knows what will appear on arXiv next month, and that they needed to put wide credibility intervals with very generous upper bounds on how much progress might take place twenty-four months’ worth of arXiv papers later.
(Also, Rohin Shah was present, so they all knew that if they named something insufficiently impossible, Rohin would have DeepMind go and do it.)
The question I asked was in a completely different genre from the panel discussion, requiring a mental context switch: the assembled luminaries actually had to try to consult their rough, scarce-formed intuitive models of progress in AI alignment and figure out what future experiences, if any, their model of the field definitely prohibited within a two-year time horizon. Instead of, well, emitting socially desirable verbal behavior meant to kill that darned optimism around AGI alignment and get some predictable applause from the audience.
I’ll be blunt: I don’t think the confident doom-and-gloom is entangled with non-social reality. If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.
Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. They named “Running a factory competently.”
I like thinking about the task “speeding up the best researchers by 30x” (to simplify, let’s only include research in purely digital (software only) domains).
To be clear, I am by no means confident that this can’t be done safely or non-agentically. It seems totally plausible to me that this can be accomplished without agency except for agency due to the natural language outputs of an LLM agent. (Perhaps I’m at 15% that this will in practice be done without any non-trivial agency that isn’t visible in natural language.)
(As such, this isn’t a good answer to the question of “I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.”. I think there probably isn’t any interesting answer to this question for me due to “very confident” being a strong condition.)
I like thinking about this task because if we were able to speed up generic research on purely digital domains by this large of an extent, safety research done with this speed up would clearly obsolete prior safety research pretty quickly.
(It also seems likely that we could singularity quite quickly from this point if wanted to, so it’s not clear we’ll have a ton of time at this capability level.)
If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.
Sorry, I might misunderstanding you (and hope I am), but… I think doomers literally say “Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed”. The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret method to know that next-token-prediction is evil.
If you meant that doomers are too confident answering the question “will SGD even make motivational structures?” their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.
I also don’t see how it logically follows that “If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks” ⇒ “then you ought to be able to say much weaker things that are impossible in two years” but it seems to be the core of the post. Even if anyone had the extraordinary model to predict what SGD exactly does (which we, as a species, should really strive for!!) it would still be a different question to predict what will or won’t happen in the next two years.
If I reason about my field (physics) the same should hold for a sentence structured like “If your model has the extraordinary power to say how an array of neutral atoms cooled to a few nK will behave when a laser is shone upon them” (which is true) ⇒ “then you ought to be able to say much weaker things that are impossible in two years in the field of cold atom physics” (which is… not true). It’s a non sequitur.
If you meant that doomers are too confident answering the question “will SGD even make motivational structures?” their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.
I also don’t see how it logically follows that “If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks” ⇒ “then you ought to be able to say much weaker things that are impossible in two years” but it seems to be the core of the post.
The relevant commonality is “ability to predict the future alignment properties and internal mechanisms of neural networks.” (Also, I don’t exactly endorse everything in this fake quotation, so indeed the analogized tasks aren’t as close as I’d like. I had to trade off between “what I actually believe” and “making minimal edits to the source material.”)
But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.
Focusing on the “minimal” part of that, maybe something like “receive a request to implement some new feature in a system it is not familiar with, recognize how the limitations of the architecture that system make that feature impractical to add, and perform a major refactoring of that program to an architecture that is not so limited, while ensuring that the refactored version does not contain any breaking changes”. Obviously it would have to have access to tools in order to do this, but my impression is that this is the sort of thing mid-level software developers can do fairly reliably as a nearly rote task, but is beyond the capabilities of modern LLM-based systems, even scaffolded ones.
Though also maybe don’t pay too much attention to my prediction, because my prediction for “least impressive thing GPT-4 will be unable to do” was “reverse a string”, and it did turn out to be able to do that fairly reliably.
I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.
That’s incredibly difficult to predict, because minimal things only a general intelligence could do are things like “deriving a few novel abstractions and building on them”, but from the outside this would be indistinguishable from it recognizing a cached pattern that it learned in-training and re-applying it, or merely-interpolating between a few such patterns. The only way you could distinguish between the two is if you have a firm grasp of every pattern in the AI’s training data, and what lies in the conceptual neighbourhood of these patterns, so that you could see if it’s genuinely venturing far from its starting ontology.
Train an AI on all of humanity’s knowledge up to a point in time T1, where T1<T2.
Assemble a list D of all scientific discoveries made in the time period (T1;T2].
See if the AI can replicate these discoveries.
At face value, if the AI can do that, it should be considered able to “do science” and therefore AGI, right?
I would dispute that. If the period (T1;T2] is short enough, then it’s likely that most of the cognitive work needed to make the leap to any discovery in D is already present in the data up to T1. Making a discovery from that starting point doesn’t necessarily require developing new abstractions/doing science — it’s possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant:
No human knows everything the entire humanity knows. Imagine if making some discovery in D by interpolation required combining two very “distant” concepts, like a physics insight and advanced biology knowledge. It’s unlikely that there’d be a human with sufficient expertise in both, so a human will likely do it by actual-science (e. g., a biologist would re-derive the physics insight from first principles).
An LLM, however, has a bird’s eye view on the entire human concept-space up to T1. It directly sees both the physics insight and the biology knowledge, at once. So it can just do an interpolation between them, without doing truly-novel research.
Thus, the ability to produce marginal scientific insights may mean either the ability to “do science”, or that the particular scientific insight is just a simple interpolation between already-known but distant concepts.
On the other hand, now imagine that the period (T1;T2] is very large, e. g. from 1940 to 2020. We’d then be asking our AI to make very significant discoveries, such that they surely can’t be done by simple interpolation, only by actually building chains of novel abstractions. But… well, most humans can’t do that either, right? Not all generally-intelligent entities are scientific geniuses. Thus, this is a challenge a “weak” AGI would not be able to meet, only a genius/superintelligent AGI — i. e., only an AGI that’s already an extinction threat.
In theory, there should be a pick of (T1;T2] that fits between the two extremes. A set of discoveries such that they can’t be done by interpolation, but also don’t require dangerous genius to solve.
But how exactly are we supposed to figure out what the right interval is? (I suppose it may not be an unsolvable problem, and I’m open to ideas, but skeptical on priors.)
I can absolutely make strong predictions regarding what non-AGI AIs would be unable to do. But these predictions are, due to the aforementioned problem, necessarily a high bar, higher than the “minimal” capability. (Also I expect an AI that can meet this high bar to also be the AI that quickly ends the world, so.)
Here’s my recent reply to Garrett, for example. tl;dr: non-GI AIs would not be widely known to be able to derive whole multi-layer novel mathematical frameworks if tasked with designing software products that require this. I’m a bit wary of reality somehow Goodharting on this prediction as well, but it seems robust enough, so I’m tentatively venturing it.
I currently think it’s about as well as you can do, regarding “minimal incapability predictions”.
Nice analogy! I approve of stuff like this. And in particular I agree that MIRI hasn’t convincingly argued that we can’t do significant good stuff (including maybe automating tons of alignment research) without agents.
Insofar as your point is that we don’t have to build agentic systems and nonagentic systems aren’t dangerous, I agree? If we could coordinate the world to avoid building agentic systems I’d feel a lot better.
I like this post although the move of imagining something fictional is not always valid.
“Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.”
Not an answer, but I would be pretty surprised if a system could beat evolution at designing humans (creating a variant of humans that have higher genetic fitness than humans if inserted into a 10,000 BC population, while not hardcoding lots of information that would be implausible for evolution) and have the resulting beings not be goal-directed. The question is then, what causes this? The genetic bottleneck, diversity of the environment, multi-agent conflicts? And is it something we can remove?
Doesn’t the first example require full-blown molecular nanotechnology? [ETA: apparently Eliezer says he thinks it can be done with “very primitive nanotechnology” but it doesn’t sound that primitive to me.] Maybe I’m misinterpreting the example, but advanced nanotech is what I’d consider extremely impressive.
I currently expect we won’t have that level of tech until after human labor is essentially obsolete. In effect, it sounds like you would not update until well after AIs already run the world, basically.
I’m not sure I understand the second example. Perhaps you can make it more concrete.
Those are pretty impressive tasks. I’m optimistic that we can achieve existential safety via automating alignment research, and I think that’s a less difficult task than those.
In an alternate universe, someone wrote a counterpart to There’s No Fire Alarm for Artificial General Intelligence:
[Somewhat off-topic]
I like thinking about the task “speeding up the best researchers by 30x” (to simplify, let’s only include research in purely digital (software only) domains).
To be clear, I am by no means confident that this can’t be done safely or non-agentically. It seems totally plausible to me that this can be accomplished without agency except for agency due to the natural language outputs of an LLM agent. (Perhaps I’m at 15% that this will in practice be done without any non-trivial agency that isn’t visible in natural language.)
(As such, this isn’t a good answer to the question of “I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.”. I think there probably isn’t any interesting answer to this question for me due to “very confident” being a strong condition.)
I like thinking about this task because if we were able to speed up generic research on purely digital domains by this large of an extent, safety research done with this speed up would clearly obsolete prior safety research pretty quickly.
(It also seems likely that we could singularity quite quickly from this point if wanted to, so it’s not clear we’ll have a ton of time at this capability level.)
Sorry, I might misunderstanding you (and hope I am), but… I think doomers literally say “Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed”. The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret method to know that next-token-prediction is evil.
If you meant that doomers are too confident answering the question “will SGD even make motivational structures?” their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.
I also don’t see how it logically follows that “If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks” ⇒ “then you ought to be able to say much weaker things that are impossible in two years” but it seems to be the core of the post. Even if anyone had the extraordinary model to predict what SGD exactly does (which we, as a species, should really strive for!!) it would still be a different question to predict what will or won’t happen in the next two years.
If I reason about my field (physics) the same should hold for a sentence structured like “If your model has the extraordinary power to say how an array of neutral atoms cooled to a few nK will behave when a laser is shone upon them” (which is true) ⇒ “then you ought to be able to say much weaker things that are impossible in two years in the field of cold atom physics” (which is… not true). It’s a non sequitur.
It would be “useful” (i.e. fitness-increasing) for wolves to have evolved biological sniper rifles, but they did not. By what evidence are we locating these motivational hypotheses, and what kinds of structures are dangerous, and why are they plausible under the NN prior?
The relevant commonality is “ability to predict the future alignment properties and internal mechanisms of neural networks.” (Also, I don’t exactly endorse everything in this fake quotation, so indeed the analogized tasks aren’t as close as I’d like. I had to trade off between “what I actually believe” and “making minimal edits to the source material.”)
Focusing on the “minimal” part of that, maybe something like “receive a request to implement some new feature in a system it is not familiar with, recognize how the limitations of the architecture that system make that feature impractical to add, and perform a major refactoring of that program to an architecture that is not so limited, while ensuring that the refactored version does not contain any breaking changes”. Obviously it would have to have access to tools in order to do this, but my impression is that this is the sort of thing mid-level software developers can do fairly reliably as a nearly rote task, but is beyond the capabilities of modern LLM-based systems, even scaffolded ones.
Though also maybe don’t pay too much attention to my prediction, because my prediction for “least impressive thing GPT-4 will be unable to do” was “reverse a string”, and it did turn out to be able to do that fairly reliably.
That’s incredibly difficult to predict, because minimal things only a general intelligence could do are things like “deriving a few novel abstractions and building on them”, but from the outside this would be indistinguishable from it recognizing a cached pattern that it learned in-training and re-applying it, or merely-interpolating between a few such patterns. The only way you could distinguish between the two is if you have a firm grasp of every pattern in the AI’s training data, and what lies in the conceptual neighbourhood of these patterns, so that you could see if it’s genuinely venturing far from its starting ontology.
Or here’s a more precise operationalization from my old reply to Rohin Shah:
I can absolutely make strong predictions regarding what non-AGI AIs would be unable to do. But these predictions are, due to the aforementioned problem, necessarily a high bar, higher than the “minimal” capability. (Also I expect an AI that can meet this high bar to also be the AI that quickly ends the world, so.)
Here’s my recent reply to Garrett, for example. tl;dr: non-GI AIs would not be widely known to be able to derive whole multi-layer novel mathematical frameworks if tasked with designing software products that require this. I’m a bit wary of reality somehow Goodharting on this prediction as well, but it seems robust enough, so I’m tentatively venturing it.
I currently think it’s about as well as you can do, regarding “minimal incapability predictions”.
Nice analogy! I approve of stuff like this. And in particular I agree that MIRI hasn’t convincingly argued that we can’t do significant good stuff (including maybe automating tons of alignment research) without agents.
Insofar as your point is that we don’t have to build agentic systems and nonagentic systems aren’t dangerous, I agree? If we could coordinate the world to avoid building agentic systems I’d feel a lot better.
I like this post although the move of imagining something fictional is not always valid.
Not an answer, but I would be pretty surprised if a system could beat evolution at designing humans (creating a variant of humans that have higher genetic fitness than humans if inserted into a 10,000 BC population, while not hardcoding lots of information that would be implausible for evolution) and have the resulting beings not be goal-directed. The question is then, what causes this? The genetic bottleneck, diversity of the environment, multi-agent conflicts? And is it something we can remove?
I admire sarcasm, but there are at least two examples of not-very-impressive tasks, like:
Put two identical on cellular level strawberries on a plate;
Develop and deploy biotech 10 year ahead of SOTA (from famous “Safely aligning powerful AGI is difficult” thread).
Doesn’t the first example require full-blown molecular nanotechnology? [ETA: apparently Eliezer says he thinks it can be done with “very primitive nanotechnology” but it doesn’t sound that primitive to me.] Maybe I’m misinterpreting the example, but advanced nanotech is what I’d consider extremely impressive.
I currently expect we won’t have that level of tech until after human labor is essentially obsolete. In effect, it sounds like you would not update until well after AIs already run the world, basically.
I’m not sure I understand the second example. Perhaps you can make it more concrete.
Those are pretty impressive tasks. I’m optimistic that we can achieve existential safety via automating alignment research, and I think that’s a less difficult task than those.