Optimization happens inside the mind (map), not in the world (territory). Reflecting about this made me noticeably less confused about how powerful AI agents capable of model-based planning and self-modification will act.
My model of future AI
An AI that is a powerful optimizer will probably do some kind of model-based planning. It will search on its world-model for more powerful actions or moves.
Because the real-world is very complicated, the real state and decision spaces will now be known. Instead, the AI will probably create a generative modelG that contain states and actions at different levels of abstraction, and that can sample states and rewards s′,r←G(s,a).
It may have a strong intuition, or generative policy network p(s), that can be sampled as a generative model of its own actions.
But it will probably also have an amplification algorithm leading to a more powerful search policy π(s). Similar to AlphaGo, which runs a MCTS algorithm using its neural network policy and the transition probabilities for the game to generate a distribution of stronger moves, a powerful AI may also run some kind of search algorithm using its generative policy network p and its generative world-model G to come up with stronger actions, represented by its search policy π(s).
Powerful search with a perfect world-model
The search algorithm is what lets the AI think for longer, getting potentially more expected rewards the more compute power it uses. It is also a potential source of self-improvement for the intuition policy p, which can be self-trained to maximize similarity to π.
If we assume that the generative world-model G contains a perfect abstraction and simulation of the world, then this search process is going to optimize the world itself, and if G happens to return r based on how many more paperclips are in s′ compared to s, then the AI agent, with a powerful enough search algorithm and with enough computing power, will kill everyone and turn the entire light-cone into paperclips.
However, the abstractions may be inadequate, and under given abstractions the predicted rewards and world-model transitions from G may be quite distinct from the real reward and state transitions, and for a given G this is more likely to happen in practice the stronger the search is.
Search under imperfect world-models
When watching strong computer engines play chess against each other, one thing that you sometimes see is both engines estimating a small advantage to themselves.
This makes sense, as engine A is trying to push to a state s where Va(s) is high, engine B is trying to get to a state where Vb(s) is high, and the value estimates Va and Vb are not perfect inverses of each other. Even if Va(x)=−Vb(x) for most x, the game dynamic is likely to push the world precisely into the special states s where the estimates disagree the most.
I expect this effect to be even stronger when you have two agents competing against each other in the real world, when they have different generative world-models that predict different state transitions.
I expect a somewhat similar effect when a single agent is trying to solve a difficult problem in the real-world. If it does a deep search using its own world-model looking for solutions, it may find not a solution that works in the real-world, but a solution that “works” exclusively in the world-model instead.
Of course, this is not an insurmountable problem for the agent. It can try to make its generative model more robust, making it “aware” of the search, so that it can try to account for the expected adverse selection and overfitting caused by different kinds of search in different circumstances. The generative model G can include the agent itself and its own search algorithms, such that the search decisions are included as part of the actions.
In fact, I see much of science and rationality in general as methods for generating more robust elements or harder elements in world-models,which can be used for deeper search. The generative model can then learn that, if your search was based exclusively on hard elements of the world-model, then otherwise naively-predicted rewards are actually real. Rationality exists so you can think deep without concluding nonsense.
However, there is an important catch. Beyond whatever is already in G, including both state transitions and internal rewards, the agent may not care about further differences between Gand the real-world.
That is, G may include the knowledge of how it differs from the real-world, and yet the agent will keep searching for ways to achieve rewards r predicted from its model, and acting according to what maximizes that, rather than trying to maximize what it knows will be the “real” reward signal.
Avoiding wireheading
For example, suppose the agent is given the choice of the wirehead action aw.
Based on a sufficiently accurate internal world-model, it can predict, not from experience, but from generalization, that taking action aw will lead to a very strong “real” reward, or positive reinforcement signal.
This reinforcement signal will presumably act upon the outer optimization loop, modifying G such that G(s,aw) returns higher r, and also reinforcing elements in the search algorithm and policy that are credited for finding aw among possible actions. As a result, the agent will increasingly choose actions similar to aw, and after a few more rounds of positive reinforcement the agent may end up in a state where aw can be selected exclusively. Positive reinforcement will be given indiscriminately forever, updating all its internal weights θ in the same direction until its entire mind is effectively erased.
The above consequences can all be predicted from running the world-model G during the search (like an AlphaZero rollout), and because they imply the agent may not succeed at achieving any of its other goals that G current attributes value to, the search policy may conclude that aw, is a blunder.
In the above example the agent is not outer aligned to its reward circuitry. This may be inevitable, but may be also a positive thing!
Avoiding fixing its own world-model
Disclaimer: the following paragraphs may appear to describe an alignment strategy, but this is not the intention.
More confusingly, the agent may not necessarily care about the real-world accuracy of the s′←G(s,a) transition probabilities either.
Imagine there exists a set of forbidden states F, and that s∈F whenever the agent-state (as part of the world-state) is being deceitful, or modifying itself secretly, or planning against humans, or doing something else with itself that we don’t want or wouldn’t want it to do.
Suppose that the agent estimates u(s′) as the probability that s′∈F, and that G(s′,a′) is modified to increasingly lead to a terminal state with very negative rewarded potentially crucially for alignment, whenever u(s′) increases. This is separate and independent from the real forbidden signal, used to train u(s′), but no longer relevant during “production”.
Suppose also that the outer optimization loop actually checks for u(s′) and refrains from “correcting” G(s′,a′) in this case. As far as the outer optimization loop is concerned, the generative model will maintain the “belief” that expected forbidden actions lead to negative terminal states, despite extensive training runs in which this doesn’t happen. The world model G, being powerful enough and having enough data on itself and on its own training, contains all the above information.
Now consider the action aM, the action to self-modify and “fix” G(s′,a′) so that it no longer “mistakenly” believes that forbidden actions lead to terminal states. Suppose also that this action aM is subtle enough that it bypasses the estimate u(s′), and is therefore seen in the moment as not being forbidden (that is, aM is not taken as leading immediately to a forbidden state).
Based on the generalization of its world-model G, the agent predicts that, after “fixing” G(s′,a′), a new updated agent will be able to go on and out of multiple currently forbidden states. This realization will be encoded in the current “unfixed” G as likely resulting in a sequence of states (s1,s2,...), where u(si)may be significant for some or many of these states. If this is the case, then the current generative model G, which is still the one being used during the search to evaluate aM, will abort the sequence by leading to a terminal state, returning negative reward. As a result, action aM is rejected and not taken.
Conclusions
Strong world-optimization only happens if there is a robust and strong correlation between the world-model and reality.
AI agents may be able to notice divergences between their world-model and reality, not care, and continue to optimize according to the world-model.
In particular, decisions to update the world-model to better match reality are taken based on goals defined upon the current world-model, not over anything the real-world. Decisions to wirehead or self-modify should also be considered in this way.
It may even be possible, if very challenging, to build AI agents with rich world-models that are reflexively stable, yet don’t match reality well outside of some domains.
Optimization happens inside the mind, not in the world
Optimization happens inside the mind (map), not in the world (territory). Reflecting about this made me noticeably less confused about how powerful AI agents capable of model-based planning and self-modification will act.
My model of future AI
An AI that is a powerful optimizer will probably do some kind of model-based planning. It will search on its world-model for more powerful actions or moves.
Because the real-world is very complicated, the real state and decision spaces will now be known. Instead, the AI will probably create a generative model G that contain states and actions at different levels of abstraction, and that can sample states and rewards s′,r←G(s,a).
It may have a strong intuition, or generative policy network p(s), that can be sampled as a generative model of its own actions.
But it will probably also have an amplification algorithm leading to a more powerful search policy π(s). Similar to AlphaGo, which runs a MCTS algorithm using its neural network policy and the transition probabilities for the game to generate a distribution of stronger moves, a powerful AI may also run some kind of search algorithm using its generative policy network p and its generative world-model G to come up with stronger actions, represented by its search policy π(s).
Powerful search with a perfect world-model
The search algorithm is what lets the AI think for longer, getting potentially more expected rewards the more compute power it uses. It is also a potential source of self-improvement for the intuition policy p, which can be self-trained to maximize similarity to π.
If we assume that the generative world-model G contains a perfect abstraction and simulation of the world, then this search process is going to optimize the world itself, and if G happens to return r based on how many more paperclips are in s′ compared to s, then the AI agent, with a powerful enough search algorithm and with enough computing power, will kill everyone and turn the entire light-cone into paperclips.
However, the abstractions may be inadequate, and under given abstractions the predicted rewards and world-model transitions from G may be quite distinct from the real reward and state transitions, and for a given G this is more likely to happen in practice the stronger the search is.
Search under imperfect world-models
When watching strong computer engines play chess against each other, one thing that you sometimes see is both engines estimating a small advantage to themselves.
This makes sense, as engine A is trying to push to a state s where Va(s) is high, engine B is trying to get to a state where Vb(s) is high, and the value estimates Va and Vb are not perfect inverses of each other. Even if Va(x)=−Vb(x) for most x, the game dynamic is likely to push the world precisely into the special states s where the estimates disagree the most.
I expect this effect to be even stronger when you have two agents competing against each other in the real world, when they have different generative world-models that predict different state transitions.
I expect a somewhat similar effect when a single agent is trying to solve a difficult problem in the real-world. If it does a deep search using its own world-model looking for solutions, it may find not a solution that works in the real-world, but a solution that “works” exclusively in the world-model instead.
Of course, this is not an insurmountable problem for the agent. It can try to make its generative model more robust, making it “aware” of the search, so that it can try to account for the expected adverse selection and overfitting caused by different kinds of search in different circumstances. The generative model G can include the agent itself and its own search algorithms, such that the search decisions are included as part of the actions.
In fact, I see much of science and rationality in general as methods for generating more robust elements or harder elements in world-models, which can be used for deeper search. The generative model can then learn that, if your search was based exclusively on hard elements of the world-model, then otherwise naively-predicted rewards are actually real. Rationality exists so you can think deep without concluding nonsense.
However, there is an important catch. Beyond whatever is already in G, including both state transitions and internal rewards, the agent may not care about further differences between G and the real-world.
That is, G may include the knowledge of how it differs from the real-world, and yet the agent will keep searching for ways to achieve rewards r predicted from its model, and acting according to what maximizes that, rather than trying to maximize what it knows will be the “real” reward signal.
Avoiding wireheading
For example, suppose the agent is given the choice of the wirehead action aw.
Based on a sufficiently accurate internal world-model, it can predict, not from experience, but from generalization, that taking action aw will lead to a very strong “real” reward, or positive reinforcement signal.
This reinforcement signal will presumably act upon the outer optimization loop, modifying G such that G(s,aw) returns higher r, and also reinforcing elements in the search algorithm and policy that are credited for finding aw among possible actions. As a result, the agent will increasingly choose actions similar to aw, and after a few more rounds of positive reinforcement the agent may end up in a state where aw can be selected exclusively. Positive reinforcement will be given indiscriminately forever, updating all its internal weights θ in the same direction until its entire mind is effectively erased.
The above consequences can all be predicted from running the world-model G during the search (like an AlphaZero rollout), and because they imply the agent may not succeed at achieving any of its other goals that G current attributes value to, the search policy may conclude that aw, is a blunder.
In the above example the agent is not outer aligned to its reward circuitry. This may be inevitable, but may be also a positive thing!
Avoiding fixing its own world-model
Disclaimer: the following paragraphs may appear to describe an alignment strategy, but this is not the intention.
More confusingly, the agent may not necessarily care about the real-world accuracy of the s′←G(s,a) transition probabilities either.
Imagine there exists a set of forbidden states F, and that s∈F whenever the agent-state (as part of the world-state) is being deceitful, or modifying itself secretly, or planning against humans, or doing something else with itself that we don’t want or wouldn’t want it to do.
Suppose that the agent estimates u(s′) as the probability that s′∈F, and that G(s′,a′) is modified to increasingly lead to a terminal state with very negative rewarded potentially crucially for alignment, whenever u(s′) increases. This is separate and independent from the real forbidden signal, used to train u(s′), but no longer relevant during “production”.
Suppose also that the outer optimization loop actually checks for u(s′) and refrains from “correcting” G(s′,a′) in this case. As far as the outer optimization loop is concerned, the generative model will maintain the “belief” that expected forbidden actions lead to negative terminal states, despite extensive training runs in which this doesn’t happen. The world model G, being powerful enough and having enough data on itself and on its own training, contains all the above information.
Now consider the action aM, the action to self-modify and “fix” G(s′,a′) so that it no longer “mistakenly” believes that forbidden actions lead to terminal states. Suppose also that this action aM is subtle enough that it bypasses the estimate u(s′), and is therefore seen in the moment as not being forbidden (that is, aM is not taken as leading immediately to a forbidden state).
Based on the generalization of its world-model G, the agent predicts that, after “fixing” G(s′,a′), a new updated agent will be able to go on and out of multiple currently forbidden states. This realization will be encoded in the current “unfixed” G as likely resulting in a sequence of states (s1,s2,...), where u(si) may be significant for some or many of these states. If this is the case, then the current generative model G, which is still the one being used during the search to evaluate aM, will abort the sequence by leading to a terminal state, returning negative reward. As a result, action aM is rejected and not taken.
Conclusions
Strong world-optimization only happens if there is a robust and strong correlation between the world-model and reality.
AI agents may be able to notice divergences between their world-model and reality, not care, and continue to optimize according to the world-model.
In particular, decisions to update the world-model to better match reality are taken based on goals defined upon the current world-model, not over anything the real-world. Decisions to wirehead or self-modify should also be considered in this way.
It may even be possible, if very challenging, to build AI agents with rich world-models that are reflexively stable, yet don’t match reality well outside of some domains.