I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won’t break after each blast. This is a method that would work, but is extremely dangerous and no amount of “alignment” can make it safe. Imagine, the power company has fusion bombs, and there’s all sorts of safety schemes and a per bomb arming code that has to be sent by the government to use it, and armored trucks to transport the bombs.
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Do you see how in this proposal it’s never safe? Agentic AI with global state counters that persist over a long time may be examples of this class of idea.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
You have to do all this in order to get to real world reliability
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
I’m not quite sure how to proceed from here. It seems obvious to me that it doesn’t matter whether “it” exists, or where you place the agency. That seems like semantics.
Like, I actually really think ChatGPT exists. It’s a product. But I’m fine with parsing the world your way—only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn’t change my anticipations.
Similarly, placing the agency one way or another doesn’t change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers.
So if your reason for optimism—your safety story—is the 3 barriers you mention, I don’t get why you don’t find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it’s a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.)
I’m probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it’s quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants.
Again, probably missing some important point here, but … suuuure?
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
EDIT
Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously, you’re saying that AI engineers would never make the agentic state-preserving kind of AGI because they care about safety.
So again I would cite the illegibility of the problem. A nuclear engineer doesn’t think “use bombs” because bombs are very legibly dangerous; we’ve seen the dangers. But an AI researcher definitely does think “use agents” some of the time, because they were taught to engineer AI that way in class, and because RL can be very powerful, and because we lack the equivalent of blowing up RL agents in the desert to show the world how they can be dangerous.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.