Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals.
The result is actually stronger than that, I think: if the AI is goal-directed at least in part, then that part will (tend to) purge the non-goal directed behaviours and then follow the EU path.
I wonder if we could get theorems as to what kinds of minimal goal directed behaviour will result in the agent becoming a completely goal-directed agent.
Seems like it comes down to the definition of goal-directed. Omohundro uses a chess-playing AI as a motivating example, and intuitively, a chess-playing AI seems “fully goal-directed”. But even as chess and go-playing AIs have become superhuman, and found creative plans humans can’t find, we haven’t seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game. A theory which can’t explain these observations doesn’t sound very useful.
Maybe this discussion is happening on the wrong level of abstaction. All abstractions are leaky, and abstractions like “intelligent”, “goal-oriented”, “creative plans”, etc. are much leakier than typical computer science abstractions. An hour of looking at the source code is going to be worth five hours of philosophizing. The most valuable thing the AI safety community can do might be to produce a checklist for someone creating the software architecture or reading the source code for the first AGI, so they know what failure modes to look for.
A chess tree search algorithm would never hit upon killing other processes. An evolutionary chess-playing algorithm might learn to do that. It’s not clear whether goal-directed is relevant to that distinction.
That’s not very imaginative. Here’s how a chess tree search algorithm—let’s take AlphaZero for concreteness—could learn to kill other processes, even if it has no explicit action which corresponds to interaction with other processes and is apparently sandboxed (aside from the usual sidechannels like resource use). It’s a variant of the evolutionary algorithm which learned to create a board so large that its competing GAs crashed/were killed while trying to deal with it (the Tic-tac-toe memory bomb). In this case, position evaluations can indirectly reveal that an exploration strategy caused enough memory use to trigger the OOM, killing rival processes, and freeing up resources for the tree search to get a higher win rate by more exploration:
one of the main limits to tree evaluation is memory consumption, due to the exponential growth of breadth-first memory requirements (this is true regardless of whether an explicit tree or implicit hash-based representation is used); to avoid this, memory consumption is often limited to a fixed amount of memory or a mix of depth/breadth-first strategies are used to tame memory growth, even though this may not be optimal, as it may force premature stopping to expansion of the game tree (resorting to light/heavy playouts) or force too much exploitation depthwise along a few promising lines of play and too little exploration etc. (One of the criticisms of AlphaZero, incidentally, was that too little RAM was given to the standard chess engines to permit them to reach their best performance.)
when a computer OS detects running out of memory, it’ll usually invoke an ‘OOM killer’, which may or may not kill the program which makes the request which uses up the last of free memory
so, it is possible that if a tree search algorithm exhausts memory (because the programmer didn’t remember to include a hard limit, the hard limit turns out to be incorrect for the machine being trained on, the limit is defined wrong like in terms of max depth instead of total nodes, etc), it may not crash or be killed but other programs, using unknown & potentially large percentages of memory, may be killed instead to free up memory. (I’ve observed this on Linux, to my frustration, where the programs I don’t want killed get killed by the OOM reaper instead of the haywire program.)
once other programs are killed to free up memory, all that memory is now available for the tree search algorithm to use; using this memory will increase performance by allowing more of the game tree to be explicitly evaluated, either wider or deeper.
in AlphaZero, the choice of widening or deepening is inherently controlled by the NN, which is trained to predict the result of the final values of each position and increase win probabilities.
reaching a position (which can be recognized by its additional complexity, indicating it lies at a certain additional depth in the tree and thus indirectly reveals how much memory is being used by the NN’s cumulative exploration) which triggers an OOM killing other programs will result in more accurate position evaluations, leading to higher values/higher win probability; so it will reinforce a strategy where it learns to aggressively widen early in the game to exhaust memory, waits for an OOM to happen, and then in the rest of the game proceeds to explore more aggressively (rather than depth-first exploit) given the new memory.
(Depending on the exact details of how the tree expansion & backups are done, it’s possible that the AlphaZero NN couldn’t observe the benefits of wide-then-deep—it might just look like noise in value estimates—but there are expert iteration variants where the NN directly controls the tree expansion rather than merely providing value estimates for the MCTS algorithm to explore using, and those should be able to observe indirect benefits of exploration strategies over a game.)
At no point does it interact directly with other processes, or even know that they exist; it just implicitly learns that expanding a decision tree in a particular wide-then-deep fashion leads to better evaluations more consistent with the true value and/or end-game result (because of side-effects leading to increased resource consumption leading to better performance). And that’s how a tree-search algorithm can hit upon killing other processes.
This story seems to reinforce my “leaky abstraction” point. The story hinges on nitty gritty details of how the AI is implemented and how the operating system manages resources. There’s no obvious usefulness in proving theorems and trying to make grand statements about utility maximizers, optimizers, goal-oriented systems, etc. I expect that by default, a programmer who tried to apply a theorem of Stuart’s to your chess system would not think to consider these details related to memory management (formally verifying a program’s source code says nothing about memory management if that happens lower in the stack). But if they did think to consider these details of memory management, having no access to Stuart’s theorem, they’d still have a good shot at preventing the problem (by changing the way the NN controls tree expansion or simply capping the program’s memory use).
Leaky abstractions are a common cause of computer security problems also. I think this is a big reason why crypto proofs fail so often. A proof is a tower on top of your existing set of abstractions; it’s fairly useless if your existing abstractions are faulty.
What I like about this thread, and why I’m worried about people reading this post and updating away from thinking that sufficiently powerful processes that don’t look like what we think are dangerous is safe, is that it helps make clear that Rohin seems to be making an argument that hinges on leaky or even confused abstractions. I’m not sure any of the rest of us have much better abstractions to offer that aren’t leaky, and I want to encourage what Rohin does in this post of thinking through the implications of the abstractions he’s using to draw conclusions that are specific enough to be critiqued, because through a process like this we can get a clearer idea of where we have shared confusion and then work to resolve it.
The argument Rohin is responding to also rests on leaky abstractions, I would argue.
At the end of the day sometimes the best approach, if there aren’t any good abstractions in a particular domain, is to set aside your abstractions and look directly at the object level.
If there is a fairly simple, robust FAI design out there, and we rule out the swath of design space it resides in based on an incorrect inference from a leaky abstraction, that would be a bad outcome.
we haven’t seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game.
It’s a good point, but… we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don’t think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.
Now, one can say that this actually means we don’t care that much about the long-term future, but I don’t agree with that conclusion; I think we *do* care (at least, I do), but aren’t very good at acting as if we(/I) do.
Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a “positive-sum” trade with us (e.g. “I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want”).
===========================
I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn’t seem mature.… at least I’m not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute)
It isn’t obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I’m not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.
Diamond maximization seems pretty different from winning at chess. In the chess case, we’ve essentially hardcoded a particular ontology related to a particular imaginary universe, the chess universe. This isn’t a feasible approach for the diamond problem.
In any case, the reason this discussion is relevant, from my perspective, is because it’s related to the question of whether you could have a system which constructs its own superintelligent understanding of the world (e.g. using self-supervised learning), and engages in self-improvement (using some process analogous to e.g. neural architecture search) without being goal-directed. If so, you could presumably pinpoint human values/corrigibility/etc. in the model of the world that was created (using labeled data, active learning, etc.) and use that as an agent’s reward function. (Or just use the self-supervised learning system as a tool to help with FAI research/make a pivotal act/etc.)
It feels to me as though the thing I described in the previous paragraph is amenable to the same general kind of ontological whitelisting approach that we use for chess AIs. (To put it another way, I suspect most insights about meta-learning can be encoded without referring to a lot of object level content about the particular universe you find yourself building a model of.) I do think there are some safety issues with the approach I described, but they seem fairly possible to overcome.
we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
Are you suggesting that Deep Blue would behave in this way if we gave it enough time to run? If so, can you explain the mechanism by which this would occur?
I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
This seems right (though I have some apprehension around talking about “parts” of an AI). From the perspective of proving a theorem, it seems like you need some sort of assumption on what the rest of the AI looks like, so that you can say something like “the goal-directed part will outcompete the other parts”. Though perhaps you could try defining goal-directed behavior as the sort of behavior that tends to grow and outcompete things—this could be a useful definition? I’m not sure.
The result is actually stronger than that, I think: if the AI is goal-directed at least in part, then that part will (tend to) purge the non-goal directed behaviours and then follow the EU path.
I wonder if we could get theorems as to what kinds of minimal goal directed behaviour will result in the agent becoming a completely goal-directed agent.
Seems like it comes down to the definition of goal-directed. Omohundro uses a chess-playing AI as a motivating example, and intuitively, a chess-playing AI seems “fully goal-directed”. But even as chess and go-playing AIs have become superhuman, and found creative plans humans can’t find, we haven’t seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game. A theory which can’t explain these observations doesn’t sound very useful.
Maybe this discussion is happening on the wrong level of abstaction. All abstractions are leaky, and abstractions like “intelligent”, “goal-oriented”, “creative plans”, etc. are much leakier than typical computer science abstractions. An hour of looking at the source code is going to be worth five hours of philosophizing. The most valuable thing the AI safety community can do might be to produce a checklist for someone creating the software architecture or reading the source code for the first AGI, so they know what failure modes to look for.
A chess tree search algorithm would never hit upon killing other processes. An evolutionary chess-playing algorithm might learn to do that. It’s not clear whether goal-directed is relevant to that distinction.
That’s not very imaginative. Here’s how a chess tree search algorithm—let’s take AlphaZero for concreteness—could learn to kill other processes, even if it has no explicit action which corresponds to interaction with other processes and is apparently sandboxed (aside from the usual sidechannels like resource use). It’s a variant of the evolutionary algorithm which learned to create a board so large that its competing GAs crashed/were killed while trying to deal with it (the Tic-tac-toe memory bomb). In this case, position evaluations can indirectly reveal that an exploration strategy caused enough memory use to trigger the OOM, killing rival processes, and freeing up resources for the tree search to get a higher win rate by more exploration:
one of the main limits to tree evaluation is memory consumption, due to the exponential growth of breadth-first memory requirements (this is true regardless of whether an explicit tree or implicit hash-based representation is used); to avoid this, memory consumption is often limited to a fixed amount of memory or a mix of depth/breadth-first strategies are used to tame memory growth, even though this may not be optimal, as it may force premature stopping to expansion of the game tree (resorting to light/heavy playouts) or force too much exploitation depthwise along a few promising lines of play and too little exploration etc. (One of the criticisms of AlphaZero, incidentally, was that too little RAM was given to the standard chess engines to permit them to reach their best performance.)
when a computer OS detects running out of memory, it’ll usually invoke an ‘OOM killer’, which may or may not kill the program which makes the request which uses up the last of free memory
so, it is possible that if a tree search algorithm exhausts memory (because the programmer didn’t remember to include a hard limit, the hard limit turns out to be incorrect for the machine being trained on, the limit is defined wrong like in terms of max depth instead of total nodes, etc), it may not crash or be killed but other programs, using unknown & potentially large percentages of memory, may be killed instead to free up memory. (I’ve observed this on Linux, to my frustration, where the programs I don’t want killed get killed by the OOM reaper instead of the haywire program.)
once other programs are killed to free up memory, all that memory is now available for the tree search algorithm to use; using this memory will increase performance by allowing more of the game tree to be explicitly evaluated, either wider or deeper.
in AlphaZero, the choice of widening or deepening is inherently controlled by the NN, which is trained to predict the result of the final values of each position and increase win probabilities.
reaching a position (which can be recognized by its additional complexity, indicating it lies at a certain additional depth in the tree and thus indirectly reveals how much memory is being used by the NN’s cumulative exploration) which triggers an OOM killing other programs will result in more accurate position evaluations, leading to higher values/higher win probability; so it will reinforce a strategy where it learns to aggressively widen early in the game to exhaust memory, waits for an OOM to happen, and then in the rest of the game proceeds to explore more aggressively (rather than depth-first exploit) given the new memory.
(Depending on the exact details of how the tree expansion & backups are done, it’s possible that the AlphaZero NN couldn’t observe the benefits of wide-then-deep—it might just look like noise in value estimates—but there are expert iteration variants where the NN directly controls the tree expansion rather than merely providing value estimates for the MCTS algorithm to explore using, and those should be able to observe indirect benefits of exploration strategies over a game.)
At no point does it interact directly with other processes, or even know that they exist; it just implicitly learns that expanding a decision tree in a particular wide-then-deep fashion leads to better evaluations more consistent with the true value and/or end-game result (because of side-effects leading to increased resource consumption leading to better performance). And that’s how a tree-search algorithm can hit upon killing other processes.
This story seems to reinforce my “leaky abstraction” point. The story hinges on nitty gritty details of how the AI is implemented and how the operating system manages resources. There’s no obvious usefulness in proving theorems and trying to make grand statements about utility maximizers, optimizers, goal-oriented systems, etc. I expect that by default, a programmer who tried to apply a theorem of Stuart’s to your chess system would not think to consider these details related to memory management (formally verifying a program’s source code says nothing about memory management if that happens lower in the stack). But if they did think to consider these details of memory management, having no access to Stuart’s theorem, they’d still have a good shot at preventing the problem (by changing the way the NN controls tree expansion or simply capping the program’s memory use).
Leaky abstractions are a common cause of computer security problems also. I think this is a big reason why crypto proofs fail so often. A proof is a tower on top of your existing set of abstractions; it’s fairly useless if your existing abstractions are faulty.
What I like about this thread, and why I’m worried about people reading this post and updating away from thinking that sufficiently powerful processes that don’t look like what we think are dangerous is safe, is that it helps make clear that Rohin seems to be making an argument that hinges on leaky or even confused abstractions. I’m not sure any of the rest of us have much better abstractions to offer that aren’t leaky, and I want to encourage what Rohin does in this post of thinking through the implications of the abstractions he’s using to draw conclusions that are specific enough to be critiqued, because through a process like this we can get a clearer idea of where we have shared confusion and then work to resolve it.
The argument Rohin is responding to also rests on leaky abstractions, I would argue.
At the end of the day sometimes the best approach, if there aren’t any good abstractions in a particular domain, is to set aside your abstractions and look directly at the object level.
If there is a fairly simple, robust FAI design out there, and we rule out the swath of design space it resides in based on an incorrect inference from a leaky abstraction, that would be a bad outcome.
It’s a good point, but… we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don’t think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.
Now, one can say that this actually means we don’t care that much about the long-term future, but I don’t agree with that conclusion; I think we *do* care (at least, I do), but aren’t very good at acting as if we(/I) do.
Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a “positive-sum” trade with us (e.g. “I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want”).
===========================
I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn’t seem mature.… at least I’m not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
It isn’t obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I’m not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.
Diamond maximization seems pretty different from winning at chess. In the chess case, we’ve essentially hardcoded a particular ontology related to a particular imaginary universe, the chess universe. This isn’t a feasible approach for the diamond problem.
In any case, the reason this discussion is relevant, from my perspective, is because it’s related to the question of whether you could have a system which constructs its own superintelligent understanding of the world (e.g. using self-supervised learning), and engages in self-improvement (using some process analogous to e.g. neural architecture search) without being goal-directed. If so, you could presumably pinpoint human values/corrigibility/etc. in the model of the world that was created (using labeled data, active learning, etc.) and use that as an agent’s reward function. (Or just use the self-supervised learning system as a tool to help with FAI research/make a pivotal act/etc.)
It feels to me as though the thing I described in the previous paragraph is amenable to the same general kind of ontological whitelisting approach that we use for chess AIs. (To put it another way, I suspect most insights about meta-learning can be encoded without referring to a lot of object level content about the particular universe you find yourself building a model of.) I do think there are some safety issues with the approach I described, but they seem fairly possible to overcome.
I strongly agree.
I should’ve been more clear.
I think this is a situation where our intuition is likely wrong.
This sort of thing is why I say “I’m not satisfied with my current understanding”.
Are you suggesting that Deep Blue would behave in this way if we gave it enough time to run? If so, can you explain the mechanism by which this would occur?
Can you share links?
I don’t know how deep blue worked. My impression was that it doesn’t use learning, so the answer would be no.
A starting point for Tom and Stuart’s works: https://scholar.google.com/scholar?rlz=1C1CHBF_enCA818CA819&um=1&ie=UTF-8&lr&cites=1927115341710450492
BoMAI is in this vein, as well ( https://arxiv.org/pdf/1905.12186.pdf )
This seems right (though I have some apprehension around talking about “parts” of an AI). From the perspective of proving a theorem, it seems like you need some sort of assumption on what the rest of the AI looks like, so that you can say something like “the goal-directed part will outcompete the other parts”. Though perhaps you could try defining goal-directed behavior as the sort of behavior that tends to grow and outcompete things—this could be a useful definition? I’m not sure.