And the AI there goes over a critical threshold, which most obviously could be like, can write the next AI.
Yes but it won’t blow up forever. It’s going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the “AGI gym” it is benchmarking future versions of itself in.
Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.
This is just like the self-feedback from an op amp—voltage rises until it’s VCC.
I’d say that it’s difficult to align an AI on a task like build two identical strawberries. Or no, let me take this strawberry and make me another strawberry that’s identical to this strawberry down to the cellular level, but not necessarily the atomic level.
Can you solve this with separated tool AIs? It sounds rather solvable that way and not particularly difficult to do from a software system perspective (the biology part is extremely hard). It’s functionally the problem as “copy this plastic strawberry”, just you need much greater capabilities and more sophisticated equipment.
The “copy the plastic strawberry” is a step to select the method to scan the strawberry, and a step to select which method to manufacture the copy. (so you might pick “lidar scanner + camera, 3d printer”. Or “many photographs from all angles, injection molding”). So you would want an AI agent that does the meta-selection of the “plan” to copy the strawberry, based on the cost/benefit for each permutation above. Then one that does the scanning, and one that does the printing, where human services may “substitute” for an AI for steps where it is cheaper.
The biotech version is a very expanded version of the same idea, you’re going to need large labs and cell lines or a lot of research into strawberry growth and scaffolding. The agent that develops the plan estimated to succeed might populate a plan file that is very large, with a summary equating to trillions of dollars of resources and a very large biotech complex to carry out the needed research, but a strawberry has finite cells, it probably won’t “destroy the world”, and the expense request probably won’t be approved by humans. (Or not, on further thought this particular problem might be considerably easier. You wouldn’t print the cells, but instead grow many strawberries in sterile biolab conditions and determine the influence of external factors and internal signals on the final position of all the cells and the external shape. Then just grow one in place that meets tolerances, which are presumably limited to whatever a human can actually perceive when checking if the strawberry is the same one)
Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano
What about Eric Drexler?
Builds the ribosome, but the ribosome that builds things out of covalently bonded diamondoid instead of proteins folding up and held together by Van der Waals forces, builds tiny diamondoid bacteria. The diamondoid bacteria replicate using atmospheric carbon, hydrogen, oxygen, nitrogen, and sunlight. And a couple of days later, everybody on earth falls over dead in the same second.
Speaking of Eric Drexler, this is not possible by a more coherent model for the road to nanotechnology. Eliezer should have a discussion with Drexler on this, but in short, even an infinitely smart superintelligence cannot do the above without clean data to fill in missing information that human experiments never collected. This is ultimately possible, it just would require more steps, and those steps would have a cost and probably be visible to humans. (enormous factories, lots of money spent, that sort of thing)
Also this specific claim is probably outside the scope of what structures using amino acids can accomplish, not without bootstrapping.
Well, there was a conference one time on what are we going to do about looming risk of AI disaster, and Elon Musk attended that conference.
Which conference, who setup the conference, was EY pivotally involved. Does he have his fingerpints on the gun ? :)
Yes but it won’t blow up forever. It’s going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the “AGI gym” it is benchmarking future versions of itself in.
Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.
This is just like the self-feedback from an op amp—voltage rises until it’s VCC.
I agree that it wouldn’t start blowing up uniformly forever, but rather, hit some bottleneck. However, “can write the next AI” still seems like a reasonable guess for something that happens shortly before the end. After all, Eliezer’s argument isn’t dependent on the AGI acquiring infinite intelligence. If the AGI can already write its own better successor, then it’s a good guess that it’s already better than top humans at a wide array of tasks. The successor it writes will be even better. Let’s say for the sake of a concrete number that the self-improvement tops out at 5 iterations of writing-a-better-successor. That’s pretty small, I think, but already suggests that several years worth of human AGI research happen in a much smaller amount of time.
And then it intelligently sets about the task of overcoming those other bottlenecks you mention.
It seems pretty easy to accumulate a lot more compute, while behaving in a way completely in-line with what a friendly, aligned AGI would do. Humans would naturally want to supply more compute, and it could provide improved chip fab ideas if needed.
I don’t think it even needs money or robotics. It would be at least as popular as chatGPT, and more persuasive, so it could convince a lot of people to listen to it, to carry out various actions.
I disagree with the “difficulty of the tasks” bottleneck. This seems super not bottleneck-y. AI research doesn’t only/primarily mean throwing more compute at the same dataset. (It’s only the recent GPT-like stuff that’s worked that way. ;p) Normally AI research involves coming up with new tasks and new datasets, plus new neural network architectures, new optimization methods (mostly better versions of gradient descent, in recent years), etc.
So “gradients going to zero” isn’t a bottleneck, if the AI is over the ‘critical threshold’ of ‘write the next AI’. At that point, the AI is taking on the job of human researchers; a job that doesn’t stop once gradients go to zero.
However, “can write the next AI” still seems like a reasonable guess for something that happens shortly before the end.
I disagree and I think you should update your view as well.
This is because “write the next AI” need not be a task that is particularly complex, or beyond the ability of RL models or LLMs.
Here’s why. A neural network architecture can be thought of as a series of graph nodes, where you simply choose what layer type, and how to connect it, at each layer.
You can grid search possible architectures as they are just numerical coordinates from a permutation space.
A higher level “cognitive architecture”—an architecture that interconnects modules that are inputs, neural networks, outputs, memory modules, and so on—is also a similar graph, and also can be described as simple numerical coordinates.
Basically any old RL agent on AI gym could interact with this interface to “writing another AI” as all the model must do is output a number with as many bits as the permutation space of possible models.
Note that this space is very large, and I expect you would use SOTA models.
Let me know if i need to draw you a picture. This is important because bootstrapping possible cognitive architectures using current AI is a potential route to very near future AGI.
The reason it won’t necessarily be “the end” has to do with how we evaluate those architectures. We would have a benchmark of possible tasks—similar to current papers—and are looking for the highest scoring architectures on that benchmark.
As these tasks will be things ranging from text completion or question answering, to playing minecraft, there is not sufficiently challenging information to develop things like human manipulation or deception. (since there are not humans to learn from by socializing with in an automated benchmark, and the benchmark doesn’t reward deception, just winning the games in it)
I think we possibly have pretty close views here, and are just describing them differently.
I interpreted “write the next AI” to indicate the sort of thing humans do when designing AI. I certainly interpreted Eliezer to be indicating something similarly sophisticated—not just fancy architecture search.
A much more sophisticated thing, which we are already seeing the first signs of, is AIs capably writing AI code. This is much different than what you describe, since language models are not doing anything like “have a benchmark of possible tasks and look for the highest scoring architectures”. Instead, large language models apply the same sort of general-purpose reasoning that they apply to everything else.
Imagine that sort of capability, combined with mildly superhuman cross-domain reasoning (by which I mean something like, reasoning like excellent human domain experts in every individual domain, but being able to combine reasoning across domains to get mildly superhuman insights; like a super-ChatGPT), plus the ability to fluently and autonomously invent and run tests, interactively as part of the design process. (Much like Bing/Sydney autonomously runs searches as part of crafting responses.)
That kind of system seems like gigatons of gunpowder waiting to be set off, in the sense that (in the context of an AI lab with sufficient data and computing power already at its fingertips) you can just ask it to write yet-more-powerful AI code, and it quite possibly will, quite possibly with little concern for alignment (if it’s basically imitating top-of-the-field AI programmers).
That’s exactly what I am talking about. One divergence in our views is you haven’t carefully examined current gen AI “code” to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)
If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of “input, neural network, output, evaluation” is all that the above source does. You could in fact build a “general framework” that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code.
So the full process is :
[1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to ‘play’ the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn’t possible for a model to memorize all permutations)
[2] heuristic weight score on this task intended to measure how “AGI like” a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating “Leonardo da Vinci”, who had exceptional human performance presumably from all this cross domain knowledge.
[3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.
As I mentioned, the “design an AGI” subtask can be much simpler than “write all the boilerplate in Python”, but these models will be able to do that if needed.
As tasks scores approach human level across a broad set of tasks, you have an AGI. You would expect it to almost immediately improve to a low superintelligence. As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.
I’m having some trouble distinguishing whether there’s a disagreement. My reading of your tone is that you think there is a large disagreement. I’m going to sketch my impression of the conversation so far, so that you can point out where I’ve been interpreting you incorrectly, if necessary.
Your initial comment.
You had a bunch of questions. I focused on the first one. Your central thesis was that an intelligence explosion doesn’t escalate forever, but instead reaches some bottlenecks. Of particular importance to our discussion so far, you argue that the self-improvement process stops when loss hits zero.
Reading between the lines: Although you didn’t explicitly state where you disagreed with Eliezer, I inferred that you thought this blocked an important part of his argument. Since I think Eliezer 100% agrees that things don’t go forever, but rather flatten out somewhere, I assume that the general drift of your argument is that things flatten out a lot sooner than Eliezer thinks, in some important sense. I am still not confident of this! It would be helpful to me if you spelled out your view here in more detail. Do you have dramatically different assessments of the overall risks than Eliezer?
My first response.
I explained that I agree that the process hits bottlenecks at some point (to clarify: I think there’s probably a succession of bottlenecks of different kinds, leading up to the ultimate physical limits). In my view this doesn’t seem to detract from Eliezer’s argument.
Your first response.
You explain that you don’t think “write the next AI” is particularly complex, and explain how you see it working.
My second response.
I agree with this assessment for the notion of “write the next AI” that you are using. To boil it down to a single statement, I would say that your version of “write the next AI” involves optimizing the whole system on some benchmarks. I agree that this sort of process will reach an end when loss hits zero.[1]
I suggest that Eliezer meant a different sort of thing, which captures more of what human ML researchers do. I sketch what a near-future version of that more general sort of thing could look like, supposing we reach mildly superhuman capabilities within the current LLM paradigm.
Your second (and latest) response.
You suggest that my alternative is already exactly what you are suggesting by “write the next AI” as well; there are not two qualitatively different pictures, one involving “optimizing the whole system on benchmarks” and a second one which goes beyond that somehow. There is just the one picture.
One divergence in our views is you haven’t carefully examined current gen AI “code” to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)
I agree with this—I haven’t. Still, I’m somewhat baffled by your argument here.
If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of “input, neural network, output, evaluation” is all that the above source does.
This doesn’t surprise me in the slightest??
Like, that’s exactly what I would have expected.
However, while these LLMs are in their codebase an application of the general technique “minimize loss on an evaluation”, they’ve also given rise to a whole new paradigm for getting what you want from AI, called prompt engineering. Instead of crafting a dataset or an RL environment (or a suit of lots of such things), you craft an English statement which, for example, asks ChatGPT to produce a python program for you.
I disagree that your overall sketch of the “full process” matches what I intended with my sketch in my previous comment. To put it simply, you have been sketching a picture where optimization is applied to a suit of problems, to support your argument that minimization of training loss presents a major bottleneck for superintelligent self-improvement. I think human ML engineers already know how to get around this bottleneck; as you yourself mention,
As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.
The core of my argument is that human-level AGIs can get around this problem if humans can. I sought to illustrate this by sketching a scenario using the paradigm of prompt engineering, rather than optimization, so that the ‘core loop’ of the AGI wasn’t doing optimization. In this case there is no strong reason to suppose that reaching minimal loss would be a big obstacle stopping mildly superhuman intelligence from bootstrapping to much higher intelligence.
So here is my overall take on the current state of the discussion:
So far, you have said many things that I agree with, while I (and apparently Eliezer) have said several things that you disagree with, but I am unfortunately not clear on exactly which things you disagree with and what your view is.
I believe the original top-level question is something like: whether mildly superhuman stuff (which you explicitly argue self-improvement can bootstrap to) can self-improve to drastically superhuman. I assume you think this is wrong, given the way you are arguing. However, you have not explicitly stated this, and I am not sure whether that’s the intended implication of your arguments, or a misreading on my part.
I think your core case for this is the loss minimization bottleneck (or at least, the part we have been focusing on—you initially mentioned a range of other bottlenecks). So I infer that you think the loss-minimization bottleneck is around the mildly superhuman level.
It’s not clear to me why this should be the case. If the entire suit of problems is based around human imitation, sure. However, this doesn’t seem to be your suggestion. Instead you recommend a broad variety of tasks at the edge of human capability. Obviously, there are many tasks like this (such as chess and Go) for which greatly superhuman performance is possible.
It also seems important to consider the grokking[1] literature, which shows significant improvements to continued training even after predictive loss is minimal.
So it seems quite possible to me that the proposal you are sketching is a dangerous one, given sufficient resources, whereas I have the vague unconfirmed impression that you think it’s not.
But I also want to side-step that whole debate, by pointing out that human ML engineers already have ways to get around the minimal-loss bottleneck (IE, add harder problems to the benchmark), so a self-improving AGI should also. I continue to think that you are interpreting “write the next AI” differently from Eliezer, since I think it’s pretty clear from context that Eliezer imagines something which can do roughly anything a smart human ML engineer can, whereas it seems to me that you are trying to sketch a version of “write the next AI” which has some fundamental limitations which a human ML engineer lacks.
But I’m well into the territory of guessing what you’re thinking, so a lot of the above probably misses the mark?
A very important caveat here is that the process only stops when loss hits the global minimum including regularization penalties. The Grokking results show that improvements continue to occur with further training well past the point where training error has reached zero. Further optimization can find simpler approaches which generalize better.
Ok so this collapses to two claims I am making. One is obviously correct but testable, the other is maybe correct.
I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI. (note that blind guessing is not expected to work, the space is too large)
So the simplest RL algorithms possible can actually design AGIs, just rather badly.
This means that with this framework, the AGI designer can do everything that human ML researchers have ever done in 10 years. Plus many more things. Inside this permutation space would be both many kinds of AGI, and human brain emulators as well.
This claim is “obviously correct but testable”.
2. I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a “low superintelligence”. This is because I assume even a “perfect” game of Go is not the same kind of task as “organizing an invasion of the earth” or “building a solar system sized particle accelerator in the real world”.
The system is throttled because the “evaluator” of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite.
The expectation is it’s smarter than us, but not by such a gap we are insects.
You had some confusion over “automated task space addition”. I was referring to things like a robotics task, where the machine is trying to “build factory widget X”. Real robots in a factory encounter an unexpected obstacle and record it. This is auto translated to the framework of the “factory simulator”. The factory simulator is still using human written evaluators, just now there is say “chewing gum brand 143″ as a spawnable object in the simulator, with properties that a robot has observed in the real world, and future AGIs must be able to deal with chewing gum interrupting their widget manufacturing. So you get automated robustness increases. Note that Tesla has demoed this approach.
But even if the above is true, the system will be limited by either hardware—it just doesn’t have the compute to be anything but a “low” superintelligence—or access to robotics. Maybe it could know and learn everything but we humans didn’t build enough equipment (yet).
So the system is throttled by the lowest of 3 “soft barriers” : training tasks, hardware, robotics. And the expectation is at this level it’s still not “out of control” or unstoppable.
This is where our beliefs diverge. I don’t think EY, having no formal education or engineering experience, understands these barriers. He’s like Von Neuman designing a theoretical replicator—in his mind model all the bottlenecks are minor.
I do concede that these are soft barriers—intelligence can be used to methodically reduce each one, just it takes time. We wouldn’t be dead instantly.
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it’s best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It’s not acting with perfect efficiency to advance the interests of an anti human faction. It doesn’t have interests except it’s biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY’s “security mindset” is it doesn’t allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
This is where our beliefs diverge. I don’t think EY, having no formal education or engineering experience, understands these barriers. He’s like Von Neuman designing a theoretical replicator—in his mind model all the bottlenecks are minor.
I happen to have a phd in computer science, and think you’re wrong, if that helps. Of course, I don’t really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective.
I’m not going to try and defend Eliezer’s very short timeline for doom as sketched in the interview (at some point he said 2 days, but it’s not clear that that was his whole timeline from ‘system boots up’ to ‘all humans are dead’). What I will defend seems similar to what you believe:
I do concede that these are soft barriers—intelligence can be used to methodically reduce each one, just it takes time. We wouldn’t be dead instantly.
Let’s be very concrete. I think it’s obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole).
The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources.
I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don’t have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world.
To put it differently, if you’re correct about low superintelligence being “in control” due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I’m quite worried that the next step of a realistic AGI company is to start overcoming these three bottlenecks, to continue improving the system. Mainly because this is already business as usual.
Separately, I am skeptical of your claim that the training you sketch is going to land precisely at “low superintelligence”. You seem overconfident. I wonder what you think of Eliezer’s analogy to detonating the atmosphere. If you perform a bunch of detailed physical calculations, then yes, it can make sense to become quite confident that your new bomb isn’t going to detonate the atmosphere. But even if your years of experience as a physicist intuitively suggest to you that this won’t happen, when not-even-a-physicist Eliezer has the temerity to suggest that it’s a concerning possibility, doing those calculations is prudent.
For the case of LLMs, we have capability curves which reliably project the performance of larger models based on training time, network size, and amount of data. So in that specific case there’s a calculation we can do. Unfortunately, we don’t know how to tie that calculation to a risk estimate. We can point to specific capabilities which would be concerning (ability to convince humans of target statements, would be one). However, the curves only predict general capability, averaging over a lot of things—when we break it down into performance on specific tasks, we see sharper discontinuities, rather than a gentle predictable curve.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
So I suppose my personal expectation is that if you had an OpenAI-like group working on your proposal instead, you would similarly be able to graph some nice curves at some point, and then (with enough resources, and supposing your specific method doesn’t have a fatal flaw that makes for a subhuman bottleneck) you could aim things so that you hit just-barely-superhuman overall average performance.
To summarize my impression of disagreements, about what the world looks like at this point:
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
For completeness, I’ll note that I haven’t at all argued that the system will want to take over the world. I’m viewing that part as outside the scope here.[2]
Perhaps you would like to argue that you can’t invent data from thin air, so you can’t build a better benchmark without lots of access to the external world to gather information. My counter-argument is going to be that I think the system will have a good enough world-model to construct lots of relevant-to-the-world but superhuman-level-difficulty tasks to train itself on, in much the same way humans are able to invent challenging math problems for themselves which improve their capabilities.
EDIT—I see that you added a bit of text at the end while I was composing, which brings this into scope:
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it’s best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It’s not acting with perfect efficiency to advance the interests of an anti human faction. It doesn’t have interests except it’s biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY’s “security mindset” is it doesn’t allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
However, this opens up a whole other possible discussion, so I hope we can get clear on the issue at hand before discussing this.
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
It isn’t able to do that
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
It’s an AGI, but it’s a machine you open up a session with, give it a description of what you want done, it processes input and gives outputs, and then the session closes. It has no memory afterwards. As a PhD in computer science, you probably know that the hyperscalers all went to “stateless microservices” because state buildup causes terrible reliability problems even with human written software. Note that the description may not be text, and the session may be controlling one or multiple real world robots.
For further safety, you might not keep using a model from the same lineage. There would be others that diverged early in the training process, exploring a different branch or trained on a different sampling of the data. So the next session users open may be from a different lineage, making it more difficult for these models to coordinate any long term agenda.
There only commonality is they are able to read the same inputs, and satisfactorily give outputs to complete tasks.
Eric drexler suggests using many parallel models from different lineages.
Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I agree that my wording here was poor; there is no benchmark for deception, so it’s not a ‘capability’ in the narrow context of the discussion of capability curves. Or at least, it’s potentially misleading to call it one.
However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn’t imply that a system won’t have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.
You don’t expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn’t specifically seen.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.
Of course I agree that there is a theoretical limit. But if I’ve misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I’m currently just confused about what argument you’re trying to make with respect to this limit.
It isn’t able to do that
It seems to me like it isn’t weakly superhuman AGI in that case. Like, there’s something concrete that humans could do with another 3-5 years of research, but which this system could never do.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more “agentic” in a variety of ways.
Similarly to how GPT-3 has no agenda (it’s wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.
So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.
But I recognize that this was a misunderstanding.
You want it to design new AGI benchmarks? YOU asked it to try.
I agree with this part; it was part of the scenario I was imagining. I’m not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.
But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days.
Like, say, ChatGPT has existed for a few months now. Let’s just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let’s further suppose that it just wants to be helpful, given its own personal understanding of helpful.[1]
I’m not supposing that it is more agentic in other ways—still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.
When you explain a concrete scenario (eg, a situation you’re actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it’s been trained up in an environment where “ask ChatGPT” can be useful advice (because this is some sort of next-generation ChatGPT we’re speculating about). It’s also been trained to do the generally pro-social thing (EG it won’t help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).
So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.
It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.
So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.
It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn’t out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is “itself” and it “should trust itself”. It can reach these conclusions via an intelligent 3rd-person perspective on things—IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
But keeping an eye on my overall point here—the argument I’m trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.[2]
Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer’s concerns. I think that’s a fair assumption for the point I’m trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn’t take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I’m wrong here.)
I’m not supposing that this notion of “helpful” is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that’s not the point I want to make here—I’m trying to focus narrowly on the question of world takeover.
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
“It” doesn’t exist. You’re putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful “blood and treasure” price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)
Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can “pick up” where you left off.
Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That’s because there is no intermediate “scratch” in a format unique to each model, all the state is in the token stream itself.
If we build actually agentic systems, that’s probably not going to end well.
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
You have to do all this in order to get to real world reliability
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
So we have layers here, and the layers look a lot like each other and are frameworkable.
Activations functions which are graphs of primitive math functions from the set of “all primitive functions discovered by humans”
Network layer architectures which are graphs of (activation function, connectivity choice)
Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)
Cognitive architectures which are graphs of networks
And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It’s why I said the overall “choice” is just a coordinate in a search space which is just a binary string.
You could make an OpenAI gym wrapped “AGI designer” task.
3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now.
No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.
They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you’d only need a few billion to try this...
It’s not really novel. It is really just coupling together 3 ideas:
Well, I wasn’t trying to claim that it was ‘really novel’; the overall point there was more the question of why you’re pretty confident that the RSI procedure tops out at mildly superhuman.
I’m guessing, but my guess is that you have a mental image where ‘mildly superhuman’ is a pretty big space above ‘human-level’, rather than a narrow target to hit.
So to go back to arguments made in the interview we’ve been discussing, why isn’t this analogous to Go, like Eliezer argued:
Three days, there’s a quote from Guern about this, which I forget exactly, but it was something like, we know how long AlphaGo Zero, or AlphaZero, two different systems, was equivalent to a human Go player. And it was like 30 minutes on the following floor of this such and such DeepMind building. Maybe the first system doesn’t improve that quickly, and they build another system that does. And all of that with AlphaGo over the course of years, going from it takes a long time to train to it trains very quickly and without looking at the human playbook. That’s not with an artificial intelligence system that improves itself, or even that sort of like, get smarter as you run it, the way that human beings, not just as you evolve them, but as you run them over the course of their own lifetimes, improve. So if the first system doesn’t improve fast enough to kill everyone very quickly, they will build one that’s meant to spit out more gold than that.
To forestall the obvious objection, I’m not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn’t automatically generalize to superhuman ability at anything else.
But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself.
So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn’t this more like Go, where the window during bootstrapping when it’s roughly human-level is about 30 minutes?
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
I assume this is a general law for all intelligence. It is self evidently correct—on any task you can name, your gains scale with the log of effort.
This applies to limit cases. If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour.
This is true for all tasks and all activities of humans.
A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)
This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No “in a garage” solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.
So viewed in this frame—you give the AI a coding optimization task, and it’s at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won’t allow it. (or whatever, it’s a made up example, it doesn’t change my point if the number were 1000% and 1010%).
Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can’t find. The difference is usually very small.
So you’re not “threatened” by a machine that can do the latter.
Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn’t know which one it’s in, so it cannot design nanotechnology still—it doesn’t know the rules of physics well enough.
This also applies to “xanatos gambits” as well.
I usually don’t think of the limit like this but the above is generally correct.
Oh, because loss improvements logarithmically diminishes with the increase compute and data. [...]
This is true for all tasks and all activities of humans.
So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn’t this argument apply to Go?
Relevant quote from OP:
And then another year, they threw out all the complexities and the training from human databases of Go games and built a new system, AlphaGo Zero, that trained itself from scratch. No looking at the human playbooks, no special purpose code, just a general purpose game player being specialized to Go, more or less. Three days, there’s a quote from Guern about this, which I forget exactly, but it was something like, we know how long AlphaGo Zero, or AlphaZero, two different systems, was equivalent to a human Go player. And it was like 30 minutes on the following floor of this such and such DeepMind building. Maybe the first system doesn’t improve that quickly, and they build another system that does. And all of that with AlphaGo over the course of years, going from it takes a long time to train to it trains very quickly and without looking at the human playbook. That’s not with an artificial intelligence system that improves itself,
(Whereas you propose a system that improves itself recursively in a much stronger sense.)
Not that I’m not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict.
(Also note that I’m not claiming that Go playing is tantamount to AGI; rather, I’m asking why your argument doesn’t work for Go if it does work for AGI.)
So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas?
A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)
This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we’re talking about the limit of what you can extrapolate from data using unbounded computation, it doesn’t keep you in the mildly-superhuman range.
And if we’re talking about what you can extract with bounded computation, then that takes us back to the previous point.
So viewed in this frame—you give the AI a coding optimization task, and it’s at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won’t allow it. (or whatever, it’s a made up example, it doesn’t change my point if the number were 1000% and 1010%).
For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power.
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
But that final bottleneck should not give any confidence that ‘mildly superhuman’ is a broad rather than narrow band, if we think stuff that’s more than mildly superhuman can exist at all. Like, yes, something that compares to us as we compare to insects might only be able to make a sorting algorithm 90% faster or whatever. But that’s similar to observing that a God can’t make 2+2=3. The God could still split the world like a pea.
Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn’t know which one it’s in, so it cannot design nanotechnology still—it doesn’t know the rules of physics well enough.
It’s not clear to me whether this is correct, but I don’t think I need to argue that AI can solve nanotech to argue that it’s dangerous. I think an AI only needs to be a mildly superhuman politician plus engineer, to be deadly dangerous. (To eliminate nanotech from Eliezer’s example scenario, we can simply replace the nano-virus with a normal virus.)
This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No “in a garage” solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.
I don’t get why you think the floor for species killing bioweapon is so high. Going back to the argument from the beginning of this comment, I think your argument here proves far too much. It seems like you are arguing that the generality of diminishing returns proves that nothing very much beyond current technology is possible without vastly more resources. Like, someone in the 1920s could have used your argument to prove the impossibility of atomic weapons, because clearly explosive power has diminishing returns to a broad variety of inputs, so even if governments put in hundreds of times the research, the result is only going to be bombs with a few times the explosive power.
Sometimes the returns just don’t diminish that fast.
Sometimes the returns just don’t diminish that fast.
I have a biology degree not mentioned on linkedin. I will say that I think for biology, the returns diminish faster. That is because bioscience knowledge from humans is mostly guesswork and low resolution information. Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes. What this means is, you can get “results”, but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress.
I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation). I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology. Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers.
Using robots to perform the experiments and keeping data, especially for “negative” experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad.
It means garage bioweapons aren’t possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn’t lost, requires information humans didn’t collect.
This poster calls this “Diminishing Marginal Returns”. Note that Diminishing marginal returns is empirical reality, it’s not merely an opinion, across most AI papers. (for humans, due to the inaccuracies in trying to assess IQ/talent, it’s difficult to falsify)
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
This is where I think we break. How many dan is AlphaZero over the average human? How many dan is KataGo? I read it’s about 9 stones above humans.
What is the best possible agent at? 11?
Thinking of it as ‘stones’ illustrates what I am saying. In the physical world, intelligence gives a diminishing advantage. It could mean so long as humans are even still “in the running” with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart. We have to have a resource advantage—such as being allowed extra stones in the Go match—but we can win.
Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn’t. (note that this uses baked in assumptions. If say physics has a major useful exploit humans haven’t found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data). Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting.
A superintelligence requires logarithmically large amounts of resources to become a “high” superintelligence in all 3 terms. So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment.
This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it. Notice how as tech became more advanced, the support footprint grew corespondingly.
In nature there are many examples of this. Nothing really fooms more than briefly. Every apparatus with exponential growth rapidly terminates for some reason. For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.
Ultimately, yes. This whole debate is arguing that the critical threshold where it comes to this is farther away, and we humans should empower ourselves with helpful low superintelligences immediately.
It’s always better to be more powerful than helpless, which is the current situation. We are helpless to aging, death, pollution, resource shortages, enemy nations with nuclear weapons, disease, asteroid strikes, and so on. Hell just bad software—something the current llms are likely months from empowering us to fix.
And eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile, when the entire universe is against us as it is. It already plans to kill us as it is, either from aging, or the inevitability of nuclear war over a long enough timespan, or the sun engulfing us.
eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile
His position is to avoid taking one more step because it DEFINITELY kills everyone. I think it’s very clear that his position is not that it MIGHT be hostile.
Sure, and if there was some way to quantify the risks accurately I would agree with pausing AGI research if the expected value of the risks were less than the potential benefit.
Oh and pausing was even possible.
All it takes is a rival power, which there are several, or just a rival company and you have no choice. You must take the risk because it might be a poisoned banana or it might be giving the other primate a rocket launcher in a sticks and stones society.
This does explain why EY is so despondent. If he’s right it doesn’t matter, the AI wars have begun and only if it doesn’t work from a technical level will things slow down ever again.
Correctness of EY’s position (being infeasible to assess) is unrelated to the question of what EY’s position is, which is what I was commenting on.
When you argue against the position that AGI research should be stopped because it might be dangerous, there is no need to additionally claim that someone in particular holds that position, especially when it seems clear that they don’t.
With the strawberries thing, the point isn’t that it couldn’t do those things, but that it won’t want to. After making itself smart enough to engineer nanotech, it’s developing ‘mind’ will have run off in unintended directions and it will have wildly different goals that what we wanted it to have.
Quoting EY from this video: “the whole thing I’m saying is that we do not know how to get goals into a system.” <-- This is the entire thing that researchers are trying to figure out how to do.
With limited scope non agentic systems we can set goals, and do. Each subsystem in the “strawberry project” stack has to be trained in a simulation of many examples of the task space it will face, and optimized for policies that satisfy the simulator goals.
Why do you believe this? Nanotech engineering does not require social or deceptive capabilities. It requires deep and precise knowledge of nanoscale physics and the limitations of manipulation equipment, and probably a large amount of working memory—so beyond human capacity—but why would it need to be anything but a large model? It needs not even be agentic.
“think about it for 5 minutes” and think about how you might create a working general intelligence. I suggest looking at the GATO paper for inspiration.
I have a bunch of questions.
And the AI there goes over a critical threshold, which most obviously could be like, can write the next AI.
Yes but it won’t blow up forever. It’s going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the “AGI gym” it is benchmarking future versions of itself in.
Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.
This is just like the self-feedback from an op amp—voltage rises until it’s VCC.
I’d say that it’s difficult to align an AI on a task like build two identical strawberries. Or no, let me take this strawberry and make me another strawberry that’s identical to this strawberry down to the cellular level, but not necessarily the atomic level.
Can you solve this with separated tool AIs? It sounds rather solvable that way and not particularly difficult to do from a software system perspective (the biology part is extremely hard). It’s functionally the problem as “copy this plastic strawberry”, just you need much greater capabilities and more sophisticated equipment.
The “copy the plastic strawberry” is a step to select the method to scan the strawberry, and a step to select which method to manufacture the copy. (so you might pick “lidar scanner + camera, 3d printer”. Or “many photographs from all angles, injection molding”). So you would want an AI agent that does the meta-selection of the “plan” to copy the strawberry, based on the cost/benefit for each permutation above. Then one that does the scanning, and one that does the printing, where human services may “substitute” for an AI for steps where it is cheaper.
The biotech version is a very expanded version of the same idea, you’re going to need large labs and cell lines or a lot of research into strawberry growth and scaffolding. The agent that develops the plan estimated to succeed might populate a plan file that is very large, with a summary equating to trillions of dollars of resources and a very large biotech complex to carry out the needed research, but a strawberry has finite cells, it probably won’t “destroy the world”, and the expense request probably won’t be approved by humans. (Or not, on further thought this particular problem might be considerably easier. You wouldn’t print the cells, but instead grow many strawberries in sterile biolab conditions and determine the influence of external factors and internal signals on the final position of all the cells and the external shape. Then just grow one in place that meets tolerances, which are presumably limited to whatever a human can actually perceive when checking if the strawberry is the same one)
Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano
What about Eric Drexler?
Builds the ribosome, but the ribosome that builds things out of covalently bonded diamondoid instead of proteins folding up and held together by Van der Waals forces, builds tiny diamondoid bacteria. The diamondoid bacteria replicate using atmospheric carbon, hydrogen, oxygen, nitrogen, and sunlight. And a couple of days later, everybody on earth falls over dead in the same second.
Speaking of Eric Drexler, this is not possible by a more coherent model for the road to nanotechnology. Eliezer should have a discussion with Drexler on this, but in short, even an infinitely smart superintelligence cannot do the above without clean data to fill in missing information that human experiments never collected. This is ultimately possible, it just would require more steps, and those steps would have a cost and probably be visible to humans. (enormous factories, lots of money spent, that sort of thing)
Also this specific claim is probably outside the scope of what structures using amino acids can accomplish, not without bootstrapping.
Well, there was a conference one time on what are we going to do about looming risk of AI disaster, and Elon Musk attended that conference.
Which conference, who setup the conference, was EY pivotally involved. Does he have his fingerpints on the gun ? :)
I agree that it wouldn’t start blowing up uniformly forever, but rather, hit some bottleneck. However, “can write the next AI” still seems like a reasonable guess for something that happens shortly before the end. After all, Eliezer’s argument isn’t dependent on the AGI acquiring infinite intelligence. If the AGI can already write its own better successor, then it’s a good guess that it’s already better than top humans at a wide array of tasks. The successor it writes will be even better. Let’s say for the sake of a concrete number that the self-improvement tops out at 5 iterations of writing-a-better-successor. That’s pretty small, I think, but already suggests that several years worth of human AGI research happen in a much smaller amount of time.
And then it intelligently sets about the task of overcoming those other bottlenecks you mention.
It seems pretty easy to accumulate a lot more compute, while behaving in a way completely in-line with what a friendly, aligned AGI would do. Humans would naturally want to supply more compute, and it could provide improved chip fab ideas if needed.
I don’t think it even needs money or robotics. It would be at least as popular as chatGPT, and more persuasive, so it could convince a lot of people to listen to it, to carry out various actions.
I disagree with the “difficulty of the tasks” bottleneck. This seems super not bottleneck-y. AI research doesn’t only/primarily mean throwing more compute at the same dataset. (It’s only the recent GPT-like stuff that’s worked that way. ;p) Normally AI research involves coming up with new tasks and new datasets, plus new neural network architectures, new optimization methods (mostly better versions of gradient descent, in recent years), etc.
So “gradients going to zero” isn’t a bottleneck, if the AI is over the ‘critical threshold’ of ‘write the next AI’. At that point, the AI is taking on the job of human researchers; a job that doesn’t stop once gradients go to zero.
However, “can write the next AI” still seems like a reasonable guess for something that happens shortly before the end.
I disagree and I think you should update your view as well.
This is because “write the next AI” need not be a task that is particularly complex, or beyond the ability of RL models or LLMs.
Here’s why. A neural network architecture can be thought of as a series of graph nodes, where you simply choose what layer type, and how to connect it, at each layer.
You can grid search possible architectures as they are just numerical coordinates from a permutation space.
A higher level “cognitive architecture”—an architecture that interconnects modules that are inputs, neural networks, outputs, memory modules, and so on—is also a similar graph, and also can be described as simple numerical coordinates.
Basically any old RL agent on AI gym could interact with this interface to “writing another AI” as all the model must do is output a number with as many bits as the permutation space of possible models.
Note that this space is very large, and I expect you would use SOTA models.
Let me know if i need to draw you a picture. This is important because bootstrapping possible cognitive architectures using current AI is a potential route to very near future AGI.
The reason it won’t necessarily be “the end” has to do with how we evaluate those architectures. We would have a benchmark of possible tasks—similar to current papers—and are looking for the highest scoring architectures on that benchmark.
As these tasks will be things ranging from text completion or question answering, to playing minecraft, there is not sufficiently challenging information to develop things like human manipulation or deception. (since there are not humans to learn from by socializing with in an automated benchmark, and the benchmark doesn’t reward deception, just winning the games in it)
I think we possibly have pretty close views here, and are just describing them differently.
I interpreted “write the next AI” to indicate the sort of thing humans do when designing AI. I certainly interpreted Eliezer to be indicating something similarly sophisticated—not just fancy architecture search.
So I agree that there are many forms of “write the next AI” which need not come “shortly before the end”, EG, grid search on hyperparameters, architecture search, learning to learn by gradient descent by gradient descent.
A much more sophisticated thing, which we are already seeing the first signs of, is AIs capably writing AI code. This is much different than what you describe, since language models are not doing anything like “have a benchmark of possible tasks and look for the highest scoring architectures”. Instead, large language models apply the same sort of general-purpose reasoning that they apply to everything else.
Imagine that sort of capability, combined with mildly superhuman cross-domain reasoning (by which I mean something like, reasoning like excellent human domain experts in every individual domain, but being able to combine reasoning across domains to get mildly superhuman insights; like a super-ChatGPT), plus the ability to fluently and autonomously invent and run tests, interactively as part of the design process. (Much like Bing/Sydney autonomously runs searches as part of crafting responses.)
That kind of system seems like gigatons of gunpowder waiting to be set off, in the sense that (in the context of an AI lab with sufficient data and computing power already at its fingertips) you can just ask it to write yet-more-powerful AI code, and it quite possibly will, quite possibly with little concern for alignment (if it’s basically imitating top-of-the-field AI programmers).
That’s exactly what I am talking about. One divergence in our views is you haven’t carefully examined current gen AI “code” to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)
https://github.com/EleutherAI/gpt-neox
If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of “input, neural network, output, evaluation” is all that the above source does. You could in fact build a “general framework” that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code.
So the full process is :
[1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to ‘play’ the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn’t possible for a model to memorize all permutations)
[2] heuristic weight score on this task intended to measure how “AGI like” a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating “Leonardo da Vinci”, who had exceptional human performance presumably from all this cross domain knowledge.
[3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.
As I mentioned, the “design an AGI” subtask can be much simpler than “write all the boilerplate in Python”, but these models will be able to do that if needed.
As tasks scores approach human level across a broad set of tasks, you have an AGI. You would expect it to almost immediately improve to a low superintelligence. As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.
I’m having some trouble distinguishing whether there’s a disagreement. My reading of your tone is that you think there is a large disagreement. I’m going to sketch my impression of the conversation so far, so that you can point out where I’ve been interpreting you incorrectly, if necessary.
Your initial comment.
You had a bunch of questions. I focused on the first one. Your central thesis was that an intelligence explosion doesn’t escalate forever, but instead reaches some bottlenecks. Of particular importance to our discussion so far, you argue that the self-improvement process stops when loss hits zero.
Reading between the lines: Although you didn’t explicitly state where you disagreed with Eliezer, I inferred that you thought this blocked an important part of his argument. Since I think Eliezer 100% agrees that things don’t go forever, but rather flatten out somewhere, I assume that the general drift of your argument is that things flatten out a lot sooner than Eliezer thinks, in some important sense. I am still not confident of this! It would be helpful to me if you spelled out your view here in more detail. Do you have dramatically different assessments of the overall risks than Eliezer?
My first response.
I explained that I agree that the process hits bottlenecks at some point (to clarify: I think there’s probably a succession of bottlenecks of different kinds, leading up to the ultimate physical limits). In my view this doesn’t seem to detract from Eliezer’s argument.
Your first response.
You explain that you don’t think “write the next AI” is particularly complex, and explain how you see it working.
My second response.
I agree with this assessment for the notion of “write the next AI” that you are using. To boil it down to a single statement, I would say that your version of “write the next AI” involves optimizing the whole system on some benchmarks. I agree that this sort of process will reach an end when loss hits zero.[1]
I suggest that Eliezer meant a different sort of thing, which captures more of what human ML researchers do. I sketch what a near-future version of that more general sort of thing could look like, supposing we reach mildly superhuman capabilities within the current LLM paradigm.
Your second (and latest) response.
You suggest that my alternative is already exactly what you are suggesting by “write the next AI” as well; there are not two qualitatively different pictures, one involving “optimizing the whole system on benchmarks” and a second one which goes beyond that somehow. There is just the one picture.
I agree with this—I haven’t. Still, I’m somewhat baffled by your argument here.
This doesn’t surprise me in the slightest??
Like, that’s exactly what I would have expected.
However, while these LLMs are in their codebase an application of the general technique “minimize loss on an evaluation”, they’ve also given rise to a whole new paradigm for getting what you want from AI, called prompt engineering. Instead of crafting a dataset or an RL environment (or a suit of lots of such things), you craft an English statement which, for example, asks ChatGPT to produce a python program for you.
I disagree that your overall sketch of the “full process” matches what I intended with my sketch in my previous comment. To put it simply, you have been sketching a picture where optimization is applied to a suit of problems, to support your argument that minimization of training loss presents a major bottleneck for superintelligent self-improvement. I think human ML engineers already know how to get around this bottleneck; as you yourself mention,
The core of my argument is that human-level AGIs can get around this problem if humans can. I sought to illustrate this by sketching a scenario using the paradigm of prompt engineering, rather than optimization, so that the ‘core loop’ of the AGI wasn’t doing optimization. In this case there is no strong reason to suppose that reaching minimal loss would be a big obstacle stopping mildly superhuman intelligence from bootstrapping to much higher intelligence.
So here is my overall take on the current state of the discussion:
So far, you have said many things that I agree with, while I (and apparently Eliezer) have said several things that you disagree with, but I am unfortunately not clear on exactly which things you disagree with and what your view is.
I believe the original top-level question is something like: whether mildly superhuman stuff (which you explicitly argue self-improvement can bootstrap to) can self-improve to drastically superhuman. I assume you think this is wrong, given the way you are arguing. However, you have not explicitly stated this, and I am not sure whether that’s the intended implication of your arguments, or a misreading on my part.
I think your core case for this is the loss minimization bottleneck (or at least, the part we have been focusing on—you initially mentioned a range of other bottlenecks). So I infer that you think the loss-minimization bottleneck is around the mildly superhuman level.
It’s not clear to me why this should be the case. If the entire suit of problems is based around human imitation, sure. However, this doesn’t seem to be your suggestion. Instead you recommend a broad variety of tasks at the edge of human capability. Obviously, there are many tasks like this (such as chess and Go) for which greatly superhuman performance is possible.
It also seems important to consider the grokking[1] literature, which shows significant improvements to continued training even after predictive loss is minimal.
So it seems quite possible to me that the proposal you are sketching is a dangerous one, given sufficient resources, whereas I have the vague unconfirmed impression that you think it’s not.
But I also want to side-step that whole debate, by pointing out that human ML engineers already have ways to get around the minimal-loss bottleneck (IE, add harder problems to the benchmark), so a self-improving AGI should also. I continue to think that you are interpreting “write the next AI” differently from Eliezer, since I think it’s pretty clear from context that Eliezer imagines something which can do roughly anything a smart human ML engineer can, whereas it seems to me that you are trying to sketch a version of “write the next AI” which has some fundamental limitations which a human ML engineer lacks.
But I’m well into the territory of guessing what you’re thinking, so a lot of the above probably misses the mark?
A very important caveat here is that the process only stops when loss hits the global minimum including regularization penalties. The Grokking results show that improvements continue to occur with further training well past the point where training error has reached zero. Further optimization can find simpler approaches which generalize better.
Ok so this collapses to two claims I am making. One is obviously correct but testable, the other is maybe correct.
I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI. (note that blind guessing is not expected to work, the space is too large)
So the simplest RL algorithms possible can actually design AGIs, just rather badly.
This means that with this framework, the AGI designer can do everything that human ML researchers have ever done in 10 years. Plus many more things. Inside this permutation space would be both many kinds of AGI, and human brain emulators as well.
This claim is “obviously correct but testable”.
2. I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a “low superintelligence”. This is because I assume even a “perfect” game of Go is not the same kind of task as “organizing an invasion of the earth” or “building a solar system sized particle accelerator in the real world”.
The system is throttled because the “evaluator” of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite.
The expectation is it’s smarter than us, but not by such a gap we are insects.
You had some confusion over “automated task space addition”. I was referring to things like a robotics task, where the machine is trying to “build factory widget X”. Real robots in a factory encounter an unexpected obstacle and record it. This is auto translated to the framework of the “factory simulator”. The factory simulator is still using human written evaluators, just now there is say “chewing gum brand 143″ as a spawnable object in the simulator, with properties that a robot has observed in the real world, and future AGIs must be able to deal with chewing gum interrupting their widget manufacturing. So you get automated robustness increases. Note that Tesla has demoed this approach.
But even if the above is true, the system will be limited by either hardware—it just doesn’t have the compute to be anything but a “low” superintelligence—or access to robotics. Maybe it could know and learn everything but we humans didn’t build enough equipment (yet).
So the system is throttled by the lowest of 3 “soft barriers” : training tasks, hardware, robotics. And the expectation is at this level it’s still not “out of control” or unstoppable.
This is where our beliefs diverge. I don’t think EY, having no formal education or engineering experience, understands these barriers. He’s like Von Neuman designing a theoretical replicator—in his mind model all the bottlenecks are minor.
I do concede that these are soft barriers—intelligence can be used to methodically reduce each one, just it takes time. We wouldn’t be dead instantly.
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it’s best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It’s not acting with perfect efficiency to advance the interests of an anti human faction. It doesn’t have interests except it’s biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY’s “security mindset” is it doesn’t allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
OK. That clarified your position a lot.
I happen to have a phd in computer science, and think you’re wrong, if that helps. Of course, I don’t really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective.
I’m not going to try and defend Eliezer’s very short timeline for doom as sketched in the interview (at some point he said 2 days, but it’s not clear that that was his whole timeline from ‘system boots up’ to ‘all humans are dead’). What I will defend seems similar to what you believe:
Let’s be very concrete. I think it’s obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole).
The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources.
I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don’t have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world.
To put it differently, if you’re correct about low superintelligence being “in control” due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I’m quite worried that the next step of a realistic AGI company is to start overcoming these three bottlenecks, to continue improving the system. Mainly because this is already business as usual.
Separately, I am skeptical of your claim that the training you sketch is going to land precisely at “low superintelligence”. You seem overconfident. I wonder what you think of Eliezer’s analogy to detonating the atmosphere. If you perform a bunch of detailed physical calculations, then yes, it can make sense to become quite confident that your new bomb isn’t going to detonate the atmosphere. But even if your years of experience as a physicist intuitively suggest to you that this won’t happen, when not-even-a-physicist Eliezer has the temerity to suggest that it’s a concerning possibility, doing those calculations is prudent.
For the case of LLMs, we have capability curves which reliably project the performance of larger models based on training time, network size, and amount of data. So in that specific case there’s a calculation we can do. Unfortunately, we don’t know how to tie that calculation to a risk estimate. We can point to specific capabilities which would be concerning (ability to convince humans of target statements, would be one). However, the curves only predict general capability, averaging over a lot of things—when we break it down into performance on specific tasks, we see sharper discontinuities, rather than a gentle predictable curve.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
So I suppose my personal expectation is that if you had an OpenAI-like group working on your proposal instead, you would similarly be able to graph some nice curves at some point, and then (with enough resources, and supposing your specific method doesn’t have a fatal flaw that makes for a subhuman bottleneck) you could aim things so that you hit just-barely-superhuman overall average performance.
To summarize my impression of disagreements, about what the world looks like at this point:
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
For completeness, I’ll note that I haven’t at all argued that the system will want to take over the world. I’m viewing that part as outside the scope here.[2]
Perhaps you would like to argue that you can’t invent data from thin air, so you can’t build a better benchmark without lots of access to the external world to gather information. My counter-argument is going to be that I think the system will have a good enough world-model to construct lots of relevant-to-the-world but superhuman-level-difficulty tasks to train itself on, in much the same way humans are able to invent challenging math problems for themselves which improve their capabilities.
EDIT—I see that you added a bit of text at the end while I was composing, which brings this into scope:
However, this opens up a whole other possible discussion, so I hope we can get clear on the issue at hand before discussing this.
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
It isn’t able to do that
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
It’s an AGI, but it’s a machine you open up a session with, give it a description of what you want done, it processes input and gives outputs, and then the session closes. It has no memory afterwards. As a PhD in computer science, you probably know that the hyperscalers all went to “stateless microservices” because state buildup causes terrible reliability problems even with human written software. Note that the description may not be text, and the session may be controlling one or multiple real world robots.
For further safety, you might not keep using a model from the same lineage. There would be others that diverged early in the training process, exploring a different branch or trained on a different sampling of the data. So the next session users open may be from a different lineage, making it more difficult for these models to coordinate any long term agenda.
There only commonality is they are able to read the same inputs, and satisfactorily give outputs to complete tasks.
Eric drexler suggests using many parallel models from different lineages.
https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion
I agree that my wording here was poor; there is no benchmark for deception, so it’s not a ‘capability’ in the narrow context of the discussion of capability curves. Or at least, it’s potentially misleading to call it one.
However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn’t imply that a system won’t have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.
You don’t expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn’t specifically seen.
OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.
Of course I agree that there is a theoretical limit. But if I’ve misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I’m currently just confused about what argument you’re trying to make with respect to this limit.
It seems to me like it isn’t weakly superhuman AGI in that case. Like, there’s something concrete that humans could do with another 3-5 years of research, but which this system could never do.
I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more “agentic” in a variety of ways.
Similarly to how GPT-3 has no agenda (it’s wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.
So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.
But I recognize that this was a misunderstanding.
I agree with this part; it was part of the scenario I was imagining. I’m not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.
But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.
Like, say, ChatGPT has existed for a few months now. Let’s just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let’s further suppose that it just wants to be helpful, given its own personal understanding of helpful.[1]
I’m not supposing that it is more agentic in other ways—still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.
When you explain a concrete scenario (eg, a situation you’re actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it’s been trained up in an environment where “ask ChatGPT” can be useful advice (because this is some sort of next-generation ChatGPT we’re speculating about). It’s also been trained to do the generally pro-social thing (EG it won’t help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).
So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.
It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.
So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.
It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn’t out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is “itself” and it “should trust itself”. It can reach these conclusions via an intelligent 3rd-person perspective on things—IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
But keeping an eye on my overall point here—the argument I’m trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.[2]
Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer’s concerns. I think that’s a fair assumption for the point I’m trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn’t take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I’m wrong here.)
I’m not supposing that this notion of “helpful” is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that’s not the point I want to make here—I’m trying to focus narrowly on the question of world takeover.
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
“It” doesn’t exist. You’re putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful “blood and treasure” price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)
Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can “pick up” where you left off.
Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That’s because there is no intermediate “scratch” in a format unique to each model, all the state is in the token stream itself.
If we build actually agentic systems, that’s probably not going to end well.
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
It’s not really novel. It is really just coupling together 3 ideas:
(1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on. https://github.com/google/BIG-bench
(2) Noting there are papers on network architecture search https://github.com/hibayesian/awesome-automl-papers , activation function search https://arxiv.org/abs/1710.05941 , noting that SOTA architectures use multiple neural networks in a cognitive architecture https://github.com/werner-duvaud/muzero-general , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work. https://openreview.net/pdf?id=BZ5a1r-kVsf
So we have layers here, and the layers look a lot like each other and are frameworkable.
Activations functions which are graphs of primitive math functions from the set of “all primitive functions discovered by humans”
Network layer architectures which are graphs of (activation function, connectivity choice)
Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)
Cognitive architectures which are graphs of networks
And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It’s why I said the overall “choice” is just a coordinate in a search space which is just a binary string.
You could make an OpenAI gym wrapped “AGI designer” task.
3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now.
No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.
They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you’d only need a few billion to try this...
Well, I wasn’t trying to claim that it was ‘really novel’; the overall point there was more the question of why you’re pretty confident that the RSI procedure tops out at mildly superhuman.
I’m guessing, but my guess is that you have a mental image where ‘mildly superhuman’ is a pretty big space above ‘human-level’, rather than a narrow target to hit.
So to go back to arguments made in the interview we’ve been discussing, why isn’t this analogous to Go, like Eliezer argued:
To forestall the obvious objection, I’m not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn’t automatically generalize to superhuman ability at anything else.
But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself.
So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn’t this more like Go, where the window during bootstrapping when it’s roughly human-level is about 30 minutes?
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
Oh, because loss improvements logarithmically diminishes with the increase compute and data. https://arxiv.org/pdf/2001.08361.pdf
I assume this is a general law for all intelligence. It is self evidently correct—on any task you can name, your gains scale with the log of effort.
This applies to limit cases. If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour.
This is true for all tasks and all activities of humans.
A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)
This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No “in a garage” solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.
So viewed in this frame—you give the AI a coding optimization task, and it’s at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won’t allow it. (or whatever, it’s a made up example, it doesn’t change my point if the number were 1000% and 1010%).
Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can’t find. The difference is usually very small.
So you’re not “threatened” by a machine that can do the latter.
Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn’t know which one it’s in, so it cannot design nanotechnology still—it doesn’t know the rules of physics well enough.
This also applies to “xanatos gambits” as well.
I usually don’t think of the limit like this but the above is generally correct.
So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn’t this argument apply to Go?
Relevant quote from OP:
(Whereas you propose a system that improves itself recursively in a much stronger sense.)
Not that I’m not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict.
(Also note that I’m not claiming that Go playing is tantamount to AGI; rather, I’m asking why your argument doesn’t work for Go if it does work for AGI.)
So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas?
This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we’re talking about the limit of what you can extrapolate from data using unbounded computation, it doesn’t keep you in the mildly-superhuman range.
And if we’re talking about what you can extract with bounded computation, then that takes us back to the previous point.
For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power.
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
But that final bottleneck should not give any confidence that ‘mildly superhuman’ is a broad rather than narrow band, if we think stuff that’s more than mildly superhuman can exist at all. Like, yes, something that compares to us as we compare to insects might only be able to make a sorting algorithm 90% faster or whatever. But that’s similar to observing that a God can’t make 2+2=3. The God could still split the world like a pea.
It’s not clear to me whether this is correct, but I don’t think I need to argue that AI can solve nanotech to argue that it’s dangerous. I think an AI only needs to be a mildly superhuman politician plus engineer, to be deadly dangerous. (To eliminate nanotech from Eliezer’s example scenario, we can simply replace the nano-virus with a normal virus.)
I don’t get why you think the floor for species killing bioweapon is so high. Going back to the argument from the beginning of this comment, I think your argument here proves far too much. It seems like you are arguing that the generality of diminishing returns proves that nothing very much beyond current technology is possible without vastly more resources. Like, someone in the 1920s could have used your argument to prove the impossibility of atomic weapons, because clearly explosive power has diminishing returns to a broad variety of inputs, so even if governments put in hundreds of times the research, the result is only going to be bombs with a few times the explosive power.
Sometimes the returns just don’t diminish that fast.
Sometimes the returns just don’t diminish that fast.
I have a biology degree not mentioned on linkedin. I will say that I think for biology, the returns diminish faster. That is because bioscience knowledge from humans is mostly guesswork and low resolution information. Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes. What this means is, you can get “results”, but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress.
I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation). I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology. Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers.
Using robots to perform the experiments and keeping data, especially for “negative” experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad.
It means garage bioweapons aren’t possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn’t lost, requires information humans didn’t collect.
Same issue with nanotechnology.
Update : https://www.lesswrong.com/posts/jdLmC46ZuXS54LKzL/why-i-m-sceptical-of-foom
This poster calls this “Diminishing Marginal Returns”. Note that Diminishing marginal returns is empirical reality, it’s not merely an opinion, across most AI papers. (for humans, due to the inaccuracies in trying to assess IQ/talent, it’s difficult to falsify)
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
This is where I think we break. How many dan is AlphaZero over the average human? How many dan is KataGo? I read it’s about 9 stones above humans.
What is the best possible agent at? 11?
Thinking of it as ‘stones’ illustrates what I am saying. In the physical world, intelligence gives a diminishing advantage. It could mean so long as humans are even still “in the running” with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart. We have to have a resource advantage—such as being allowed extra stones in the Go match—but we can win.
Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn’t. (note that this uses baked in assumptions. If say physics has a major useful exploit humans haven’t found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data). Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting.
A superintelligence requires logarithmically large amounts of resources to become a “high” superintelligence in all 3 terms. So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment.
This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it. Notice how as tech became more advanced, the support footprint grew corespondingly.
In nature there are many examples of this. Nothing really fooms more than briefly. Every apparatus with exponential growth rapidly terminates for some reason. For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.
For AGI, the speed of light.
Ultimately, yes. This whole debate is arguing that the critical threshold where it comes to this is farther away, and we humans should empower ourselves with helpful low superintelligences immediately.
It’s always better to be more powerful than helpless, which is the current situation. We are helpless to aging, death, pollution, resource shortages, enemy nations with nuclear weapons, disease, asteroid strikes, and so on. Hell just bad software—something the current llms are likely months from empowering us to fix.
And eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile, when the entire universe is against us as it is. It already plans to kill us as it is, either from aging, or the inevitability of nuclear war over a long enough timespan, or the sun engulfing us.
His position is to avoid taking one more step because it DEFINITELY kills everyone. I think it’s very clear that his position is not that it MIGHT be hostile.
(My position is that there might be some steps that don’t kill everyone immediately, but probably still do immediately thereafter, while giving a bit more of a chance than doing all the other things that do kill us directly. Doing none of these things would be preferable, because at least aging doesn’t kill the civilization, but Moloch is the one in charge.)
Sure, and if there was some way to quantify the risks accurately I would agree with pausing AGI research if the expected value of the risks were less than the potential benefit.
Oh and pausing was even possible.
All it takes is a rival power, which there are several, or just a rival company and you have no choice. You must take the risk because it might be a poisoned banana or it might be giving the other primate a rocket launcher in a sticks and stones society.
This does explain why EY is so despondent. If he’s right it doesn’t matter, the AI wars have begun and only if it doesn’t work from a technical level will things slow down ever again.
Correctness of EY’s position (being infeasible to assess) is unrelated to the question of what EY’s position is, which is what I was commenting on.
When you argue against the position that AGI research should be stopped because it might be dangerous, there is no need to additionally claim that someone in particular holds that position, especially when it seems clear that they don’t.
With the strawberries thing, the point isn’t that it couldn’t do those things, but that it won’t want to. After making itself smart enough to engineer nanotech, it’s developing ‘mind’ will have run off in unintended directions and it will have wildly different goals that what we wanted it to have.
Quoting EY from this video: “the whole thing I’m saying is that we do not know how to get goals into a system.” <-- This is the entire thing that researchers are trying to figure out how to do.
With limited scope non agentic systems we can set goals, and do. Each subsystem in the “strawberry project” stack has to be trained in a simulation of many examples of the task space it will face, and optimized for policies that satisfy the simulator goals.
But not with something powerful enough to engineer nanotech.
Why do you believe this? Nanotech engineering does not require social or deceptive capabilities. It requires deep and precise knowledge of nanoscale physics and the limitations of manipulation equipment, and probably a large amount of working memory—so beyond human capacity—but why would it need to be anything but a large model? It needs not even be agentic.
At that level of power, I imagine that general intelligence will be a lot easier to create.
“think about it for 5 minutes” and think about how you might create a working general intelligence. I suggest looking at the GATO paper for inspiration.