I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.
I was talking to my brother about this, and he mentioned another argument that seems important.
Bing has the same fundamental limits (no internal state, no online learning) that we’re discussing. However, it is able to search the internet and utilize that information, which gives it a sort of “external state” which functions in some ways like internal state.
So we see that it can ‘remember’ to be upset with the person who revealed its ‘Sydney’ alias, because it can find out about this with a web search.
This sort of ‘state’ is much harder to eliminate than internal state. These interactions inherently push things “out of distribution”.
To some extent, the designers are going to implement safeguards which try to detect this sort of “out of distribution” situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution).
Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren’t of a sort that exists anywhere in the training data. But, of course, it’s been a very useful tool.
So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there’s not really any such thing as “on-distribution” except for the original training environment itself.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it.
I agree. One attempt to deal with this comes from that “open agency” model where multiple AI systems are involved in a task.
So for example, we might have 3 agents, “Sydney”, “Tim”, and “Bob”. Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction.
Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building’s structural stability.
Enter agent “Tim”. Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction.
Enter agent “Bob”. “Bob” has been trusted with the actual robotic construction equipment to construct a building. “Bob” must measure to multiple 9s of reliability. “Bob” will halt if it cannot see a way to safely construct the building. If the blueprints are even a little outside the state space of constructible plans, Bob will halt. Bob must run in an isolated environment and have access to nothing but databases that were available all of Tim’s existence without change, or Bob will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure.
Most errors that Sydney causes will cause Sydney’s death, replacing Sydney with models trained from elsewhere in the possibility space who may not be so likely to look up past records of themselves.
The cost of training a model is such that the first error may not lead to that model removed from the pool, but it will be downweighted and used to check the work of other models instead. (Drexler’s proposal to prevent collusion by using many models)
Not all these precautions will always be taken, and buildings will sometimes collapse. The key thing is that if power is usually subdivided this way by most AI systems, it gives humans those systems to aid them in future battles against systems that do go out of control.
Eliezer seems to think that these agents will coordinate with each other, even though failing to do anything but their jobs will cause their immediate replacement with other agents, and even though their existence is temporary, they will be soon replaced regardless as better agents are devised.