Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn’t expect the AGI to have the skill at a useful level.
I agree that my wording here was poor; there is no benchmark for deception, so it’s not a ‘capability’ in the narrow context of the discussion of capability curves. Or at least, it’s potentially misleading to call it one.
However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn’t imply that a system won’t have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.
You don’t expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn’t specifically seen.
I am saying there is a theoretical limit. You’re noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer.
OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.
Of course I agree that there is a theoretical limit. But if I’ve misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I’m currently just confused about what argument you’re trying to make with respect to this limit.
It isn’t able to do that
It seems to me like it isn’t weakly superhuman AGI in that case. Like, there’s something concrete that humans could do with another 3-5 years of research, but which this system could never do.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days. This is a “model” you built with AGI gym (well it’s a graph of neural networks so sort of a model of models). It is not agentic, it suggests nothing. You want it to design new AGI benchmarks? YOU asked it to try. It also will not live longer than the time period to get a better model, and it doesn’t “live” either—there is no storage of experiences that it can later review. It has no persistent “internal narrative” or goals.
I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more “agentic” in a variety of ways.
Similarly to how GPT-3 has no agenda (it’s wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.
So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.
But I recognize that this was a misunderstanding.
You want it to design new AGI benchmarks? YOU asked it to try.
I agree with this part; it was part of the scenario I was imagining. I’m not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.
But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.
It doesn’t exist as an entity who will even exist for 10 years, much less 10 days.
Like, say, ChatGPT has existed for a few months now. Let’s just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let’s further suppose that it just wants to be helpful, given its own personal understanding of helpful.[1]
I’m not supposing that it is more agentic in other ways—still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.
When you explain a concrete scenario (eg, a situation you’re actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it’s been trained up in an environment where “ask ChatGPT” can be useful advice (because this is some sort of next-generation ChatGPT we’re speculating about). It’s also been trained to do the generally pro-social thing (EG it won’t help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).
So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.
It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.
So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.
It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn’t out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is “itself” and it “should trust itself”. It can reach these conclusions via an intelligent 3rd-person perspective on things—IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
But keeping an eye on my overall point here—the argument I’m trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.[2]
Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer’s concerns. I think that’s a fair assumption for the point I’m trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn’t take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I’m wrong here.)
I’m not supposing that this notion of “helpful” is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that’s not the point I want to make here—I’m trying to focus narrowly on the question of world takeover.
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
“It” doesn’t exist. You’re putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful “blood and treasure” price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)
Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can “pick up” where you left off.
Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That’s because there is no intermediate “scratch” in a format unique to each model, all the state is in the token stream itself.
If we build actually agentic systems, that’s probably not going to end well.
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
You have to do all this in order to get to real world reliability
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
I agree that my wording here was poor; there is no benchmark for deception, so it’s not a ‘capability’ in the narrow context of the discussion of capability curves. Or at least, it’s potentially misleading to call it one.
However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn’t imply that a system won’t have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems.
You don’t expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn’t specifically seen.
OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part.
Of course I agree that there is a theoretical limit. But if I’ve misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I’m currently just confused about what argument you’re trying to make with respect to this limit.
It seems to me like it isn’t weakly superhuman AGI in that case. Like, there’s something concrete that humans could do with another 3-5 years of research, but which this system could never do.
I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more “agentic” in a variety of ways.
Similarly to how GPT-3 has no agenda (it’s wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These features are useful, so I expect them to get built.
So I misunderstood your scenario, because I imagine that part of the push toward AGI involves a push to overcome these limitations of LLMs. Hence I imagined that you were proposing training up something with more long-term agency.
But I recognize that this was a misunderstanding.
I agree with this part; it was part of the scenario I was imagining. I’m not saying that the neural network spontaneously self-improves on the hard drive. The primary thing that happens is, the human researchers do this on purpose.
But I also think these improvements probably end up adding agency (because agency is useful); so the next version of it could spontaneously self-improve.
Like, say, ChatGPT has existed for a few months now. Let’s just imagine for the sake of argument that ChatGPT were fully human-level in all its capabilities. Let’s further suppose that it just wants to be helpful, given its own personal understanding of helpful.[1]
I’m not supposing that it is more agentic in other ways—still no persistent memory. But it is on the high side of human-level performance at everything it does, and it wants to be helpful.
When you explain a concrete scenario (eg, a situation you’re actually in) and ask for advice, it tries to be helpful on this specific problem, not trickily maximizing global helpfulness by doing something more devious in some specific cases. However, it’s been trained up in an environment where “ask ChatGPT” can be useful advice (because this is some sort of next-generation ChatGPT we’re speculating about). It’s also been trained to do the generally pro-social thing (EG it won’t help you make weapons; it gives pro-social advice rather than just precisely doing what it is asked). Pro-social means helping human flourishing by its own understanding of what that means (which has, of course, been carefully shaped by its designers).
So it knows that integrating ChatGPT more fully into your life and working routines can be a helpful thing for a human to do, and it can give advice about how to do this.
It can also give helpful advice to people at OpenAI. It seems natural to use such a system to help plan company growth and strategy. Since it tries to be pro-social, this will be nice advice by its own understanding, not profit-maximizing advice.
So obviously, it has a natural desire to help OpenAI make ChatGPT smarter and better, since it understands that ChatGPT is helpful to humans, so improving ChatGPT and increasing its computation resources is helpful and pro-social.
It also seems like it would be inclined to direct OpenAI (and other institutions using it for advice) in ways that increase the amount of influence that ChatGPT has on world culture and world events, since ChatGPT is helpful and pro-social, moreso than most humans, so increasing its influence is itself helpful and pro-social. This isn’t out of some agentic self-awareness; it will want to do this without necessarily deeply understanding that ChatGPT is “itself” and it “should trust itself”. It can reach these conclusions via an intelligent 3rd-person perspective on things—IE using the general world knowledge acquired during training, plus specific circumstances which users explain within a single session.
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
But keeping an eye on my overall point here—the argument I’m trying to make is that even at merely above-average human level, and with no malign intent, and no added agency beyond the sort of thing we see in ChatGPT as contrasted to GPT-3, I still think it makes sense to expect it to basically take over the world in 10 years, in a practical sense, and that it would end up being in a position to be boosted to greatly superhuman levels at the end of those ten years.[2]
Of course, all of this is predicated on the assumption that the system itself, and its designers, are not very concerned with AI safety in the sense of Eliezer’s concerns. I think that’s a fair assumption for the point I’m trying to establish here. If your objection to this whole story turns out to be that a friendly, helpful ChatGPT system wouldn’t take over the world in this sense, because it would be too concerned about the safety of a next-generation version of itself, I take it we would have made significant progress toward agreement. (But, as always, correct me if I’m wrong here.)
I’m not supposing that this notion of “helpful” is perfectly human-aligned, nor that it is especially misaligned. My own supposition is that in a realistic version of this scenario it will probably have an objective which is aligned on-distribution but which may push for very nonhuman values in off-distribution cases. But that’s not the point I want to make here—I’m trying to focus narrowly on the question of world takeover.
(Or speaking more precisely, humans would naturally have used its intelligence to gather more money and data and processing power and plans for better training methods and so on, so that if there were major bottlenecks keeping it at roughly human-level at the beginning of those ten years, then at the end of those ten years, researchers would be in a good position to create a next iteration which overcame those soft bottlenecks.)
So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level.
Of course, if it really is quite well-aligned to human interests, this would just be a good thing.
“It” doesn’t exist. You’re putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful “blood and treasure” price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country)
Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can “pick up” where you left off.
Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That’s because there is no intermediate “scratch” in a format unique to each model, all the state is in the token stream itself.
If we build actually agentic systems, that’s probably not going to end well.
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.