Ok so this collapses to two claims I am making. One is obviously correct but testable, the other is maybe correct.
I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI. (note that blind guessing is not expected to work, the space is too large)
So the simplest RL algorithms possible can actually design AGIs, just rather badly.
This means that with this framework, the AGI designer can do everything that human ML researchers have ever done in 10 years. Plus many more things. Inside this permutation space would be both many kinds of AGI, and human brain emulators as well.
This claim is “obviously correct but testable”.
2. I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a “low superintelligence”. This is because I assume even a “perfect” game of Go is not the same kind of task as “organizing an invasion of the earth” or “building a solar system sized particle accelerator in the real world”.
The system is throttled because the “evaluator” of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite.
The expectation is it’s smarter than us, but not by such a gap we are insects.
You had some confusion over “automated task space addition”. I was referring to things like a robotics task, where the machine is trying to “build factory widget X”. Real robots in a factory encounter an unexpected obstacle and record it. This is auto translated to the framework of the “factory simulator”. The factory simulator is still using human written evaluators, just now there is say “chewing gum brand 143″ as a spawnable object in the simulator, with properties that a robot has observed in the real world, and future AGIs must be able to deal with chewing gum interrupting their widget manufacturing. So you get automated robustness increases. Note that Tesla has demoed this approach.
But even if the above is true, the system will be limited by either hardware—it just doesn’t have the compute to be anything but a “low” superintelligence—or access to robotics. Maybe it could know and learn everything but we humans didn’t build enough equipment (yet).
So the system is throttled by the lowest of 3 “soft barriers” : training tasks, hardware, robotics. And the expectation is at this level it’s still not “out of control” or unstoppable.
This is where our beliefs diverge. I don’t think EY, having no formal education or engineering experience, understands these barriers. He’s like Von Neuman designing a theoretical replicator—in his mind model all the bottlenecks are minor.
I do concede that these are soft barriers—intelligence can be used to methodically reduce each one, just it takes time. We wouldn’t be dead instantly.
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it’s best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It’s not acting with perfect efficiency to advance the interests of an anti human faction. It doesn’t have interests except it’s biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY’s “security mindset” is it doesn’t allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
This is where our beliefs diverge. I don’t think EY, having no formal education or engineering experience, understands these barriers. He’s like Von Neuman designing a theoretical replicator—in his mind model all the bottlenecks are minor.
I happen to have a phd in computer science, and think you’re wrong, if that helps. Of course, I don’t really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective.
I’m not going to try and defend Eliezer’s very short timeline for doom as sketched in the interview (at some point he said 2 days, but it’s not clear that that was his whole timeline from ‘system boots up’ to ‘all humans are dead’). What I will defend seems similar to what you believe:
I do concede that these are soft barriers—intelligence can be used to methodically reduce each one, just it takes time. We wouldn’t be dead instantly.
Let’s be very concrete. I think it’s obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole).
The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources.
I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don’t have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world.
To put it differently, if you’re correct about low superintelligence being “in control” due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I’m quite worried that the next step of a realistic AGI company is to start overcoming these three bottlenecks, to continue improving the system. Mainly because this is already business as usual.
Separately, I am skeptical of your claim that the training you sketch is going to land precisely at “low superintelligence”. You seem overconfident. I wonder what you think of Eliezer’s analogy to detonating the atmosphere. If you perform a bunch of detailed physical calculations, then yes, it can make sense to become quite confident that your new bomb isn’t going to detonate the atmosphere. But even if your years of experience as a physicist intuitively suggest to you that this won’t happen, when not-even-a-physicist Eliezer has the temerity to suggest that it’s a concerning possibility, doing those calculations is prudent.
For the case of LLMs, we have capability curves which reliably project the performance of larger models based on training time, network size, and amount of data. So in that specific case there’s a calculation we can do. Unfortunately, we don’t know how to tie that calculation to a risk estimate. We can point to specific capabilities which would be concerning (ability to convince humans of target statements, would be one). However, the curves only predict general capability, averaging over a lot of things—when we break it down into performance on specific tasks, we see sharper discontinuities, rather than a gentle predictable curve.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
So I suppose my personal expectation is that if you had an OpenAI-like group working on your proposal instead, you would similarly be able to graph some nice curves at some point, and then (with enough resources, and supposing your specific method doesn’t have a fatal flaw that makes for a subhuman bottleneck) you could aim things so that you hit just-barely-superhuman overall average performance.
To summarize my impression of disagreements, about what the world looks like at this point:
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
For completeness, I’ll note that I haven’t at all argued that the system will want to take over the world. I’m viewing that part as outside the scope here.[2]
Perhaps you would like to argue that you can’t invent data from thin air, so you can’t build a better benchmark without lots of access to the external world to gather information. My counter-argument is going to be that I think the system will have a good enough world-model to construct lots of relevant-to-the-world but superhuman-level-difficulty tasks to train itself on, in much the same way humans are able to invent challenging math problems for themselves which improve their capabilities.
EDIT—I see that you added a bit of text at the end while I was composing, which brings this into scope:
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it’s best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It’s not acting with perfect efficiency to advance the interests of an anti human faction. It doesn’t have interests except it’s biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY’s “security mindset” is it doesn’t allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
However, this opens up a whole other possible discussion, so I hope we can get clear on the issue at hand before discussing this.
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
It isn’t able to do that
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
It’s an AGI, but it’s a machine you open up a session with, give it a description of what you want done, it processes input and gives outputs, and then the session closes. It has no memory afterwards. As a PhD in computer science, you probably know that the hyperscalers all went to “stateless microservices” because state buildup causes terrible reliability problems even with human written software. Note that the description may not be text, and the session may be controlling one or multiple real world robots.
For further safety, you might not keep using a model from the same lineage. There would be others that diverged early in the training process, exploring a different branch or trained on a different sampling of the data. So the next session users open may be from a different lineage, making it more difficult for these models to coordinate any long term agenda.
There only commonality is they are able to read the same inputs, and satisfactorily give outputs to complete tasks.
Eric drexler suggests using many parallel models from different lineages.
Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I agree that my wording here was poor; there is no benchmark for deception, so it’s not a ‘capability’ in the narrow context of the discussion of capability curves. Or at least, it’s potentially misleading to call it one.
However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn’t imply that a system won’t have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.
You don’t expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn’t specifically seen.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.
Of course I agree that there is a theoretical limit. But if I’ve misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I’m currently just confused about what argument you’re trying to make with respect to this limit.
It isn’t able to do that
It seems to me like it isn’t weakly superhuman AGI in that case. Like, there’s something concrete that humans could do with another 3-5 years of research, but which this system could never do.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more “agentic” in a variety of ways.
Similarly to how GPT-3 has no agenda (it’s wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.
So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.
But I recognize that this was a misunderstanding.
You want it to design new AGI benchmarks? YOU asked it to try.
I agree with this part; it was part of the scenario I was imagining. I’m not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.
But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days.
Like, say, ChatGPT has existed for a few months now. Let’s just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let’s further suppose that it just wants to be helpful, given its own personal understanding of helpful.[1]
I’m not supposing that it is more agentic in other ways—still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.
When you explain a concrete scenario (eg, a situation you’re actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it’s been trained up in an environment where “ask ChatGPT” can be useful advice (because this is some sort of next-generation ChatGPT we’re speculating about). It’s also been trained to do the generally pro-social thing (EG it won’t help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).
So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.
It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.
So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.
It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn’t out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is “itself” and it “should trust itself”. It can reach these conclusions via an intelligent 3rd-person perspective on things—IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
But keeping an eye on my overall point here—the argument I’m trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.[2]
Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer’s concerns. I think that’s a fair assumption for the point I’m trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn’t take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I’m wrong here.)
I’m not supposing that this notion of “helpful” is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that’s not the point I want to make here—I’m trying to focus narrowly on the question of world takeover.
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
“It” doesn’t exist. You’re putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful “blood and treasure” price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)
Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can “pick up” where you left off.
Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That’s because there is no intermediate “scratch” in a format unique to each model, all the state is in the token stream itself.
If we build actually agentic systems, that’s probably not going to end well.
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
You have to do all this in order to get to real world reliability
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
So we have layers here, and the layers look a lot like each other and are frameworkable.
Activations functions which are graphs of primitive math functions from the set of “all primitive functions discovered by humans”
Network layer architectures which are graphs of (activation function, connectivity choice)
Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)
Cognitive architectures which are graphs of networks
And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It’s why I said the overall “choice” is just a coordinate in a search space which is just a binary string.
You could make an OpenAI gym wrapped “AGI designer” task.
3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now.
No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.
They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you’d only need a few billion to try this...
It’s not really novel. It is really just coupling together 3 ideas:
Well, I wasn’t trying to claim that it was ‘really novel’; the overall point there was more the question of why you’re pretty confident that the RSI procedure tops out at mildly superhuman.
I’m guessing, but my guess is that you have a mental image where ‘mildly superhuman’ is a pretty big space above ‘human-level’, rather than a narrow target to hit.
So to go back to arguments made in the interview we’ve been discussing, why isn’t this analogous to Go, like Eliezer argued:
Three days, there’s a quote from Guern about this, which I forget exactly, but it was something like, we know how long AlphaGo Zero, or AlphaZero, two different systems, was equivalent to a human Go player. And it was like 30 minutes on the following floor of this such and such DeepMind building. Maybe the first system doesn’t improve that quickly, and they build another system that does. And all of that with AlphaGo over the course of years, going from it takes a long time to train to it trains very quickly and without looking at the human playbook. That’s not with an artificial intelligence system that improves itself, or even that sort of like, get smarter as you run it, the way that human beings, not just as you evolve them, but as you run them over the course of their own lifetimes, improve. So if the first system doesn’t improve fast enough to kill everyone very quickly, they will build one that’s meant to spit out more gold than that.
To forestall the obvious objection, I’m not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn’t automatically generalize to superhuman ability at anything else.
But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself.
So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn’t this more like Go, where the window during bootstrapping when it’s roughly human-level is about 30 minutes?
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
I assume this is a general law for all intelligence. It is self evidently correct—on any task you can name, your gains scale with the log of effort.
This applies to limit cases. If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour.
This is true for all tasks and all activities of humans.
A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)
This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No “in a garage” solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.
So viewed in this frame—you give the AI a coding optimization task, and it’s at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won’t allow it. (or whatever, it’s a made up example, it doesn’t change my point if the number were 1000% and 1010%).
Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can’t find. The difference is usually very small.
So you’re not “threatened” by a machine that can do the latter.
Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn’t know which one it’s in, so it cannot design nanotechnology still—it doesn’t know the rules of physics well enough.
This also applies to “xanatos gambits” as well.
I usually don’t think of the limit like this but the above is generally correct.
Oh, because loss improvements logarithmically diminishes with the increase compute and data. [...]
This is true for all tasks and all activities of humans.
So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn’t this argument apply to Go?
Relevant quote from OP:
And then another year, they threw out all the complexities and the training from human databases of Go games and built a new system, AlphaGo Zero, that trained itself from scratch. No looking at the human playbooks, no special purpose code, just a general purpose game player being specialized to Go, more or less. Three days, there’s a quote from Guern about this, which I forget exactly, but it was something like, we know how long AlphaGo Zero, or AlphaZero, two different systems, was equivalent to a human Go player. And it was like 30 minutes on the following floor of this such and such DeepMind building. Maybe the first system doesn’t improve that quickly, and they build another system that does. And all of that with AlphaGo over the course of years, going from it takes a long time to train to it trains very quickly and without looking at the human playbook. That’s not with an artificial intelligence system that improves itself,
(Whereas you propose a system that improves itself recursively in a much stronger sense.)
Not that I’m not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict.
(Also note that I’m not claiming that Go playing is tantamount to AGI; rather, I’m asking why your argument doesn’t work for Go if it does work for AGI.)
So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas?
A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)
This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we’re talking about the limit of what you can extrapolate from data using unbounded computation, it doesn’t keep you in the mildly-superhuman range.
And if we’re talking about what you can extract with bounded computation, then that takes us back to the previous point.
So viewed in this frame—you give the AI a coding optimization task, and it’s at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won’t allow it. (or whatever, it’s a made up example, it doesn’t change my point if the number were 1000% and 1010%).
For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power.
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
But that final bottleneck should not give any confidence that ‘mildly superhuman’ is a broad rather than narrow band, if we think stuff that’s more than mildly superhuman can exist at all. Like, yes, something that compares to us as we compare to insects might only be able to make a sorting algorithm 90% faster or whatever. But that’s similar to observing that a God can’t make 2+2=3. The God could still split the world like a pea.
Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn’t know which one it’s in, so it cannot design nanotechnology still—it doesn’t know the rules of physics well enough.
It’s not clear to me whether this is correct, but I don’t think I need to argue that AI can solve nanotech to argue that it’s dangerous. I think an AI only needs to be a mildly superhuman politician plus engineer, to be deadly dangerous. (To eliminate nanotech from Eliezer’s example scenario, we can simply replace the nano-virus with a normal virus.)
This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No “in a garage” solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.
I don’t get why you think the floor for species killing bioweapon is so high. Going back to the argument from the beginning of this comment, I think your argument here proves far too much. It seems like you are arguing that the generality of diminishing returns proves that nothing very much beyond current technology is possible without vastly more resources. Like, someone in the 1920s could have used your argument to prove the impossibility of atomic weapons, because clearly explosive power has diminishing returns to a broad variety of inputs, so even if governments put in hundreds of times the research, the result is only going to be bombs with a few times the explosive power.
Sometimes the returns just don’t diminish that fast.
Sometimes the returns just don’t diminish that fast.
I have a biology degree not mentioned on linkedin. I will say that I think for biology, the returns diminish faster. That is because bioscience knowledge from humans is mostly guesswork and low resolution information. Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes. What this means is, you can get “results”, but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress.
I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation). I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology. Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers.
Using robots to perform the experiments and keeping data, especially for “negative” experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad.
It means garage bioweapons aren’t possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn’t lost, requires information humans didn’t collect.
This poster calls this “Diminishing Marginal Returns”. Note that Diminishing marginal returns is empirical reality, it’s not merely an opinion, across most AI papers. (for humans, due to the inaccuracies in trying to assess IQ/talent, it’s difficult to falsify)
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
This is where I think we break. How many dan is AlphaZero over the average human? How many dan is KataGo? I read it’s about 9 stones above humans.
What is the best possible agent at? 11?
Thinking of it as ‘stones’ illustrates what I am saying. In the physical world, intelligence gives a diminishing advantage. It could mean so long as humans are even still “in the running” with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart. We have to have a resource advantage—such as being allowed extra stones in the Go match—but we can win.
Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn’t. (note that this uses baked in assumptions. If say physics has a major useful exploit humans haven’t found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data). Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting.
A superintelligence requires logarithmically large amounts of resources to become a “high” superintelligence in all 3 terms. So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment.
This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it. Notice how as tech became more advanced, the support footprint grew corespondingly.
In nature there are many examples of this. Nothing really fooms more than briefly. Every apparatus with exponential growth rapidly terminates for some reason. For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.
Ultimately, yes. This whole debate is arguing that the critical threshold where it comes to this is farther away, and we humans should empower ourselves with helpful low superintelligences immediately.
It’s always better to be more powerful than helpless, which is the current situation. We are helpless to aging, death, pollution, resource shortages, enemy nations with nuclear weapons, disease, asteroid strikes, and so on. Hell just bad software—something the current llms are likely months from empowering us to fix.
And eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile, when the entire universe is against us as it is. It already plans to kill us as it is, either from aging, or the inevitability of nuclear war over a long enough timespan, or the sun engulfing us.
eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile
His position is to avoid taking one more step because it DEFINITELY kills everyone. I think it’s very clear that his position is not that it MIGHT be hostile.
Sure, and if there was some way to quantify the risks accurately I would agree with pausing AGI research if the expected value of the risks were less than the potential benefit.
Oh and pausing was even possible.
All it takes is a rival power, which there are several, or just a rival company and you have no choice. You must take the risk because it might be a poisoned banana or it might be giving the other primate a rocket launcher in a sticks and stones society.
This does explain why EY is so despondent. If he’s right it doesn’t matter, the AI wars have begun and only if it doesn’t work from a technical level will things slow down ever again.
Correctness of EY’s position (being infeasible to assess) is unrelated to the question of what EY’s position is, which is what I was commenting on.
When you argue against the position that AGI research should be stopped because it might be dangerous, there is no need to additionally claim that someone in particular holds that position, especially when it seems clear that they don’t.
Ok so this collapses to two claims I am making. One is obviously correct but testable, the other is maybe correct.
I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI. (note that blind guessing is not expected to work, the space is too large)
So the simplest RL algorithms possible can actually design AGIs, just rather badly.
This means that with this framework, the AGI designer can do everything that human ML researchers have ever done in 10 years. Plus many more things. Inside this permutation space would be both many kinds of AGI, and human brain emulators as well.
This claim is “obviously correct but testable”.
2. I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a “low superintelligence”. This is because I assume even a “perfect” game of Go is not the same kind of task as “organizing an invasion of the earth” or “building a solar system sized particle accelerator in the real world”.
The system is throttled because the “evaluator” of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite.
The expectation is it’s smarter than us, but not by such a gap we are insects.
You had some confusion over “automated task space addition”. I was referring to things like a robotics task, where the machine is trying to “build factory widget X”. Real robots in a factory encounter an unexpected obstacle and record it. This is auto translated to the framework of the “factory simulator”. The factory simulator is still using human written evaluators, just now there is say “chewing gum brand 143″ as a spawnable object in the simulator, with properties that a robot has observed in the real world, and future AGIs must be able to deal with chewing gum interrupting their widget manufacturing. So you get automated robustness increases. Note that Tesla has demoed this approach.
But even if the above is true, the system will be limited by either hardware—it just doesn’t have the compute to be anything but a “low” superintelligence—or access to robotics. Maybe it could know and learn everything but we humans didn’t build enough equipment (yet).
So the system is throttled by the lowest of 3 “soft barriers” : training tasks, hardware, robotics. And the expectation is at this level it’s still not “out of control” or unstoppable.
This is where our beliefs diverge. I don’t think EY, having no formal education or engineering experience, understands these barriers. He’s like Von Neuman designing a theoretical replicator—in his mind model all the bottlenecks are minor.
I do concede that these are soft barriers—intelligence can be used to methodically reduce each one, just it takes time. We wouldn’t be dead instantly.
The other major divergence is if you consider how an AGI trained this way will likely behave, it will almost certainly act just like current llms. Give it a task, it does it’s best to answer/perform by the prompt (DAN is actually a positive sign), idles otherwise.
It’s not acting with perfect efficiency to advance the interests of an anti human faction. It doesn’t have interests except it’s biased towards doing really well towards in distribution tasks. (and this allows for an obvious safety mechanism to prevent use out of distribution)
One problem with EY’s “security mindset” is it doesn’t allow you to do anything. The worst case scenario is a fear that will stop you from building anything in the real world.
OK. That clarified your position a lot.
I happen to have a phd in computer science, and think you’re wrong, if that helps. Of course, I don’t really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective.
I’m not going to try and defend Eliezer’s very short timeline for doom as sketched in the interview (at some point he said 2 days, but it’s not clear that that was his whole timeline from ‘system boots up’ to ‘all humans are dead’). What I will defend seems similar to what you believe:
Let’s be very concrete. I think it’s obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole).
The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources.
I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don’t have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world.
To put it differently, if you’re correct about low superintelligence being “in control” due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I’m quite worried that the next step of a realistic AGI company is to start overcoming these three bottlenecks, to continue improving the system. Mainly because this is already business as usual.
Separately, I am skeptical of your claim that the training you sketch is going to land precisely at “low superintelligence”. You seem overconfident. I wonder what you think of Eliezer’s analogy to detonating the atmosphere. If you perform a bunch of detailed physical calculations, then yes, it can make sense to become quite confident that your new bomb isn’t going to detonate the atmosphere. But even if your years of experience as a physicist intuitively suggest to you that this won’t happen, when not-even-a-physicist Eliezer has the temerity to suggest that it’s a concerning possibility, doing those calculations is prudent.
For the case of LLMs, we have capability curves which reliably project the performance of larger models based on training time, network size, and amount of data. So in that specific case there’s a calculation we can do. Unfortunately, we don’t know how to tie that calculation to a risk estimate. We can point to specific capabilities which would be concerning (ability to convince humans of target statements, would be one). However, the curves only predict general capability, averaging over a lot of things—when we break it down into performance on specific tasks, we see sharper discontinuities, rather than a gentle predictable curve.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
So I suppose my personal expectation is that if you had an OpenAI-like group working on your proposal instead, you would similarly be able to graph some nice curves at some point, and then (with enough resources, and supposing your specific method doesn’t have a fatal flaw that makes for a subhuman bottleneck) you could aim things so that you hit just-barely-superhuman overall average performance.
To summarize my impression of disagreements, about what the world looks like at this point:
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
For completeness, I’ll note that I haven’t at all argued that the system will want to take over the world. I’m viewing that part as outside the scope here.[2]
Perhaps you would like to argue that you can’t invent data from thin air, so you can’t build a better benchmark without lots of access to the external world to gather information. My counter-argument is going to be that I think the system will have a good enough world-model to construct lots of relevant-to-the-world but superhuman-level-difficulty tasks to train itself on, in much the same way humans are able to invent challenging math problems for themselves which improve their capabilities.
EDIT—I see that you added a bit of text at the end while I was composing, which brings this into scope:
However, this opens up a whole other possible discussion, so I hope we can get clear on the issue at hand before discussing this.
The curves let you forecast average capability, but it’s much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don’t help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception.
Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I don’t buy that, at this point, you’ve necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn’t to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
As I’ve argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark—it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more.
It isn’t able to do that
Even setting aside all of the above concerns, I’ve argued the mildly superhuman system is already in a very good position to do what it wants with the world on a ten-year timeline.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
It’s an AGI, but it’s a machine you open up a session with, give it a description of what you want done, it processes input and gives outputs, and then the session closes. It has no memory afterwards. As a PhD in computer science, you probably know that the hyperscalers all went to “stateless microservices” because state buildup causes terrible reliability problems even with human written software. Note that the description may not be text, and the session may be controlling one or multiple real world robots.
For further safety, you might not keep using a model from the same lineage. There would be others that diverged early in the training process, exploring a different branch or trained on a different sampling of the data. So the next session users open may be from a different lineage, making it more difficult for these models to coordinate any long term agenda.
There only commonality is they are able to read the same inputs, and satisfactorily give outputs to complete tasks.
Eric drexler suggests using many parallel models from different lineages.
https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion
I agree that my wording here was poor; there is no benchmark for deception, so it’s not a ‘capability’ in the narrow context of the discussion of capability curves. Or at least, it’s potentially misleading to call it one.
However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn’t imply that a system won’t have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.
You don’t expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn’t specifically seen.
OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.
Of course I agree that there is a theoretical limit. But if I’ve misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I’m currently just confused about what argument you’re trying to make with respect to this limit.
It seems to me like it isn’t weakly superhuman AGI in that case. Like, there’s something concrete that humans could do with another 3-5 years of research, but which this system could never do.
I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more “agentic” in a variety of ways.
Similarly to how GPT-3 has no agenda (it’s wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.
So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.
But I recognize that this was a misunderstanding.
I agree with this part; it was part of the scenario I was imagining. I’m not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.
But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.
Like, say, ChatGPT has existed for a few months now. Let’s just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let’s further suppose that it just wants to be helpful, given its own personal understanding of helpful.[1]
I’m not supposing that it is more agentic in other ways—still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.
When you explain a concrete scenario (eg, a situation you’re actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it’s been trained up in an environment where “ask ChatGPT” can be useful advice (because this is some sort of next-generation ChatGPT we’re speculating about). It’s also been trained to do the generally pro-social thing (EG it won’t help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).
So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.
It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.
So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.
It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn’t out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is “itself” and it “should trust itself”. It can reach these conclusions via an intelligent 3rd-person perspective on things—IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
But keeping an eye on my overall point here—the argument I’m trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.[2]
Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer’s concerns. I think that’s a fair assumption for the point I’m trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn’t take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I’m wrong here.)
I’m not supposing that this notion of “helpful” is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that’s not the point I want to make here—I’m trying to focus narrowly on the question of world takeover.
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
“It” doesn’t exist. You’re putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful “blood and treasure” price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)
Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can “pick up” where you left off.
Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That’s because there is no intermediate “scratch” in a format unique to each model, all the state is in the token stream itself.
If we build actually agentic systems, that’s probably not going to end well.
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
It’s not really novel. It is really just coupling together 3 ideas:
(1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on. https://github.com/google/BIG-bench
(2) Noting there are papers on network architecture search https://github.com/hibayesian/awesome-automl-papers , activation function search https://arxiv.org/abs/1710.05941 , noting that SOTA architectures use multiple neural networks in a cognitive architecture https://github.com/werner-duvaud/muzero-general , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work. https://openreview.net/pdf?id=BZ5a1r-kVsf
So we have layers here, and the layers look a lot like each other and are frameworkable.
Activations functions which are graphs of primitive math functions from the set of “all primitive functions discovered by humans”
Network layer architectures which are graphs of (activation function, connectivity choice)
Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)
Cognitive architectures which are graphs of networks
And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It’s why I said the overall “choice” is just a coordinate in a search space which is just a binary string.
You could make an OpenAI gym wrapped “AGI designer” task.
3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now.
No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.
They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you’d only need a few billion to try this...
Well, I wasn’t trying to claim that it was ‘really novel’; the overall point there was more the question of why you’re pretty confident that the RSI procedure tops out at mildly superhuman.
I’m guessing, but my guess is that you have a mental image where ‘mildly superhuman’ is a pretty big space above ‘human-level’, rather than a narrow target to hit.
So to go back to arguments made in the interview we’ve been discussing, why isn’t this analogous to Go, like Eliezer argued:
To forestall the obvious objection, I’m not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn’t automatically generalize to superhuman ability at anything else.
But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself.
So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn’t this more like Go, where the window during bootstrapping when it’s roughly human-level is about 30 minutes?
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
Oh, because loss improvements logarithmically diminishes with the increase compute and data. https://arxiv.org/pdf/2001.08361.pdf
I assume this is a general law for all intelligence. It is self evidently correct—on any task you can name, your gains scale with the log of effort.
This applies to limit cases. If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour.
This is true for all tasks and all activities of humans.
A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others)
This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No “in a garage” solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high.
So viewed in this frame—you give the AI a coding optimization task, and it’s at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans.
You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans.
This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won’t allow it. (or whatever, it’s a made up example, it doesn’t change my point if the number were 1000% and 1010%).
Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can’t find. The difference is usually very small.
So you’re not “threatened” by a machine that can do the latter.
Note also that an infinite superintelligence cannot solve MNT, even though it has the compute to play forward the universe by known laws of physics until it gets the present.
This is because with infinite compute there are many universes with differences in the laws of physics that match up perfectly to the observable present, and the machine doesn’t know which one it’s in, so it cannot design nanotechnology still—it doesn’t know the rules of physics well enough.
This also applies to “xanatos gambits” as well.
I usually don’t think of the limit like this but the above is generally correct.
So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn’t this argument apply to Go?
Relevant quote from OP:
(Whereas you propose a system that improves itself recursively in a much stronger sense.)
Not that I’m not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict.
(Also note that I’m not claiming that Go playing is tantamount to AGI; rather, I’m asking why your argument doesn’t work for Go if it does work for AGI.)
So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas?
This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we’re talking about the limit of what you can extrapolate from data using unbounded computation, it doesn’t keep you in the mildly-superhuman range.
And if we’re talking about what you can extract with bounded computation, then that takes us back to the previous point.
For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power.
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
But that final bottleneck should not give any confidence that ‘mildly superhuman’ is a broad rather than narrow band, if we think stuff that’s more than mildly superhuman can exist at all. Like, yes, something that compares to us as we compare to insects might only be able to make a sorting algorithm 90% faster or whatever. But that’s similar to observing that a God can’t make 2+2=3. The God could still split the world like a pea.
It’s not clear to me whether this is correct, but I don’t think I need to argue that AI can solve nanotech to argue that it’s dangerous. I think an AI only needs to be a mildly superhuman politician plus engineer, to be deadly dangerous. (To eliminate nanotech from Eliezer’s example scenario, we can simply replace the nano-virus with a normal virus.)
I don’t get why you think the floor for species killing bioweapon is so high. Going back to the argument from the beginning of this comment, I think your argument here proves far too much. It seems like you are arguing that the generality of diminishing returns proves that nothing very much beyond current technology is possible without vastly more resources. Like, someone in the 1920s could have used your argument to prove the impossibility of atomic weapons, because clearly explosive power has diminishing returns to a broad variety of inputs, so even if governments put in hundreds of times the research, the result is only going to be bombs with a few times the explosive power.
Sometimes the returns just don’t diminish that fast.
Sometimes the returns just don’t diminish that fast.
I have a biology degree not mentioned on linkedin. I will say that I think for biology, the returns diminish faster. That is because bioscience knowledge from humans is mostly guesswork and low resolution information. Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes. What this means is, you can get “results”, but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress.
I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation). I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology. Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers.
Using robots to perform the experiments and keeping data, especially for “negative” experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad.
It means garage bioweapons aren’t possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn’t lost, requires information humans didn’t collect.
Same issue with nanotechnology.
Update : https://www.lesswrong.com/posts/jdLmC46ZuXS54LKzL/why-i-m-sceptical-of-foom
This poster calls this “Diminishing Marginal Returns”. Note that Diminishing marginal returns is empirical reality, it’s not merely an opinion, across most AI papers. (for humans, due to the inaccuracies in trying to assess IQ/talent, it’s difficult to falsify)
I agree that the actual speed improvement for the optimized code can’t go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous.
This is where I think we break. How many dan is AlphaZero over the average human? How many dan is KataGo? I read it’s about 9 stones above humans.
What is the best possible agent at? 11?
Thinking of it as ‘stones’ illustrates what I am saying. In the physical world, intelligence gives a diminishing advantage. It could mean so long as humans are even still “in the running” with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart. We have to have a resource advantage—such as being allowed extra stones in the Go match—but we can win.
Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn’t. (note that this uses baked in assumptions. If say physics has a major useful exploit humans haven’t found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)
And, to reiterate some more of Eliezer’s points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn’t we see another system in a small number of months/years which didn’t top out in that way?
So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data). Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting.
A superintelligence requires logarithmically large amounts of resources to become a “high” superintelligence in all 3 terms. So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment.
This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it. Notice how as tech became more advanced, the support footprint grew corespondingly.
In nature there are many examples of this. Nothing really fooms more than briefly. Every apparatus with exponential growth rapidly terminates for some reason. For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.
For AGI, the speed of light.
Ultimately, yes. This whole debate is arguing that the critical threshold where it comes to this is farther away, and we humans should empower ourselves with helpful low superintelligences immediately.
It’s always better to be more powerful than helpless, which is the current situation. We are helpless to aging, death, pollution, resource shortages, enemy nations with nuclear weapons, disease, asteroid strikes, and so on. Hell just bad software—something the current llms are likely months from empowering us to fix.
And eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile, when the entire universe is against us as it is. It already plans to kill us as it is, either from aging, or the inevitability of nuclear war over a long enough timespan, or the sun engulfing us.
His position is to avoid taking one more step because it DEFINITELY kills everyone. I think it’s very clear that his position is not that it MIGHT be hostile.
(My position is that there might be some steps that don’t kill everyone immediately, but probably still do immediately thereafter, while giving a bit more of a chance than doing all the other things that do kill us directly. Doing none of these things would be preferable, because at least aging doesn’t kill the civilization, but Moloch is the one in charge.)
Sure, and if there was some way to quantify the risks accurately I would agree with pausing AGI research if the expected value of the risks were less than the potential benefit.
Oh and pausing was even possible.
All it takes is a rival power, which there are several, or just a rival company and you have no choice. You must take the risk because it might be a poisoned banana or it might be giving the other primate a rocket launcher in a sticks and stones society.
This does explain why EY is so despondent. If he’s right it doesn’t matter, the AI wars have begun and only if it doesn’t work from a technical level will things slow down ever again.
Correctness of EY’s position (being infeasible to assess) is unrelated to the question of what EY’s position is, which is what I was commenting on.
When you argue against the position that AGI research should be stopped because it might be dangerous, there is no need to additionally claim that someone in particular holds that position, especially when it seems clear that they don’t.