I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
You have to do all this in order to get to real world reliability
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
I’m interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe.
Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested.
Other proposals, safety is empirical.
You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb)
You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data)
You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars)
After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a ‘sanity check’)
You have to do all this in order to get to real world reliability—something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn’t, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy.
Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use offline training.
I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy—I still think that the lure of personal assistants who remember previous conversations in order to react appropriately—as one example—could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it’s pretty clear that I should update toward your view here.
After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction).
I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn’t enough, but OTOH I haven’t thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1]
I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2]
I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
I could spell out those arguments in a lot more detail, but in the end it’s not a compelling counter-argument to your points. I hesitate to call a stateless system AGI, since it is missing core human competencies; not just memory but other core competencies which build on memory. But, fine; if I insisted on using that language, your argument would simply be that engineers won’t try to create AGI by that definition.
See this post for some reasons I expect increased agency as an incidental consequence of improvements, and especially the discussion in this comment. And this post and this comment.
I still think that the lure of personal assistants who remember previous conversations in order to react appropriately
This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no “global” counter that it increments for every “satisfied user” or some other heuristic. It’s not necessarily even the same model—all the context required to continue a session has to be in that “context” data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user.
This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large.
I also think there are a lot of applications where designers don’t want reliability, exactly. The obvious example is AI art.
There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don’t want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits—no on-line learning, no ‘state’ that causes it to process prompts differently.
It’s the same argument. Production software systems from the giants all have converged to this because it is correct. “janky” software you are familiar with usually belongs to poor companies, and I don’t think this is a coincidence.
I still expect that one component of that, for ‘typical’ agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of ‘agency’.)
Power seeking behavior likely comes from an outer goal, like “make more money”, aka a global state counter. If the system produces the same outputs in any order it is run, and has no “benefit” from the board state changing favorably (because it will often not even be the agent ‘seeing’ futures with a better board state, it will have been replaced with a different agent) this breaks.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.