Imagine three levels, in order of increasing concern: (1) system does self-preserving action sometimes randomly, no more often than chance. (2) system does self-preserving action randomly, but once it sees the good consequences, starts doing it systematically. (3) system does self-preserving action systematically from the start, because it had foresight and motivation. Humans and problematic AIs are up at (3), a population of bacteria undergoing evolution are at (2), and a self-unaware oracle is at (1).
(2) system does self-preserving action randomly, but once it sees the good consequences, starts doing it systematically.
a population of bacteria undergoing evolution are at (2)
I think the concern with your proposal is that good consequences don’t seem like they need to be seen, they can also be the result of selection (namely, surviving). If there are a bunch of prediction projects and ones which by chance are biased (or other wise handled) in ways that increase their chances of “survival”** then as time goes on, older[4] projects are more inclined in that way* because the ones that weren’t shut down at higher rates (and ones biased/handled in the other direction shut down at really high rates).***
*Whatever that way is.
**This need not be a long lasting property, only a local one. That which was adaptive in other circumstances can be maladaptive when things change. This doesn’t seem like a problem until you consider the results of evolution—it takes a long time, but eventually it gets more general capabilities.
***This might have more to do with the way the project is handled than the code though—it depends on the difference in weight between factors, and doesn’t incorporate the consequences of multiple equilibria. It’s a rough model, and a better model would be more explicit about how the oracle’s predictions affect it’s chances of surviving, but drawing attention to that factor was the point. (And just saying “what if they ask the oracle if running it is worth it” doesn’t address the possibility that they don’t, the way this argument does.)[4] How new projects would work isn’t super clear—they could start over from scratch, use the same code as someone else but feed in different information. They might also try to imitate older projects, prorpagating tendencies which tend to keep projects alive, which might involve biased oracles. This problem might be fixed by having survorship track accuracy, but as things change, what led to accuracy in the past might lead to inaccuracy in the future.
Thanks for this helpful comment. The architecture I’m imagining is: Model-choosing code finds a good predictive world-model out of a vast but finite space of possible world-models, by running SGD on 100,000 years of YouTube videos (or whatever). So the model-chooser is explicitly an optimizer, the engineer who created the model-chooser is also explicitly an optimizer, and the eventual predictive world-model is an extremely complicated entity with superhuman world-modeling capabilities, and I am loath to say anything about what it is or what it’s going to do.
Out of these three, (1) the engineer is not problematic because it’s a human, (2) the model-chooser is not problematic because it’s (I assume and expect) a known and well-understood algorithm (e.g. Transformer), and thus (3) the eventual predictive world-model is the only thing we’re potentially worried about. My thought is that, we can protect ourselves from the predictive world-model doing problematic consequentialist planning by scheming to give it no information whatsoever about how it can affect the world, even knowing that it exists or knowing what actions it is taking, such that if it has problematic optimization tendencies, it is unable to act on them.
(In regards to (1) more specifically, if a company is designing a camera, the cameras with properties that the engineers like are preferentially copied by the engineers into later versions. Yes, this is a form of optimization, but nobody worries about it more than anything else in life. Right?)
And while it’s not a really good optimizer, if you run it too long**, it’s “not safe” (it made us after all), - it can get to GI (or AGI).
a camera
I don’t think cameras think. They have a simple state which is determined by the settings (unless somethings broken), and they take images. While we could argue that they have a memory 1) in the form of images and video, and 2) in the form of state, they don’t do things with that information. (While you might be able to use them for video editing, that’s largely a manual process, and is mostly performed on (bigger) computers designed for more general use.)
Comparatively, the point of Machine Learning is...machine learning. Today that’s learning to recognize things (like human faces) and do things (like video games). If things get to the point where I don’t drive a car, instead my car drives itself—that will be pretty big.
a known and well-understood algorithm
And the thing about some of this new stuff like neural nets is that we don’t. They’re not hardcoded. They learn. And the more difficult/complex the task is, the harder they can be to understand. Currently, I’m more worried about a system which we don’t understand being put in an important role and failing unexpectedly, or people using these tools for sinister ends (I’m not the biggest fan of facial recognition tech), than a super-intelligent AGI. AlphaStar didn’t win by hacking its opponent’s computer or causing a seizure.
*Or evolutionary algorithms
**What you’re running it on (the problem you’re trying to solve, and how complex it’s allowed to get) might also matter.
Just to be clear, when OpenAI trained GPT-2, I am not saying that GPT-2 is a known and well-understood algorithm for generating text, but rather that SGD (Stochastic Gradient Descent) is a known and well-understood algorithm for generating GPT-2. (I mean, OK sure, ML researchers are still studying SGD, but its inner workings are not an impenetrable mystery the way that GPT-2′s are.)
Why should we expect, though, that if it is the case that a self-unaware oracle only performs self-preserving actions sometimes randomly that it wouldn’t incidentally optimize for that? I’m willing to believe it may be possible to construct a system with optimizing pressures weak enough that it couldn’t end up incidentally optimizing for other things that are instrumentally useful (and unsafe), and I have some vague ideas about how that might happen, but I’m unclear from what you’ve presented so far why I should expect a self-unaware oracle to be such a system.
I’m not necessarily asking for a mathematically rigorous proof, but I also don’t see a reasonable story that would lead me to conclude that, thus I fall back on my prior assumption that optimizing systems are at risk of developing secondary behaviors because they are useful in optimizing for the primary target.
I’m not sure what you have in mind here; to me, optimization requires some causal pathway from “Action X has consequence Y” to “Take Action X more often than chance”.
A system can optimize if it has a way to store specially-flagged information in the form of “I took action X, and it had consequence Y” (or “if I take action X, it will have consequence Y”), and then bring that flagged information to bear when taking actions. A population of bacteria can do this! Evolution flags its “decisions” (mutations), storing that information in DNA, and then “consults” the DNA when “deciding” what the gene distribution will be in the next generation. A self-unaware system, lacking any “I” or “my decision” or “my action” flags in either its internal or external universe, would be missing the causal links necessary to optimize anything. Right?
But if you build something that can’t optimize that’s not really AI or an oracle, that’s just regular software that doesn’t learn. I guess an expert system, for example, is functionally kind of like an oracle and it would meet your requirement of self-unawareness, but it also seems pretty uninteresting from a capabilities standpoint since any learning it does happens offline and only via external reprogramming of the algorithm (and then you just pass the buck to whatever that external thing doing the reprogramming is, be it human or another AI).
To me this is sort of like saying “hey, look, we can make provably correct software, just don’t put any loops that might be unbounded in it”. Sure, that works, but it also restricts the class of what you can achieve so much that people have generally chosen not to tradeoff the additional capabilities for correctness. I think here we’re looking at something similar here if your notion of self-unawareness also mean no optimization and learning.
A self-unaware system would not be capable of one particular type of optimization task:
Take real-world actions (“write bit 0 into register 11”) on the basis of anticipating their real-world consequences (human will read this bit and then do such-and-such).
This thing is an example of an optimization task, and it’s a very dangerous one. Maybe it’s even the only type of really dangerous optimization task! (This might be an overstatement, not sure.) Not all optimization tasks are in this category, and a system can be intelligent by doing other different types of optimization tasks.
A self-unaware system certainly is an optimizer in the sense that it does other (non-real-world) optimization tasks, in particular, finding the string of bits that would be most likely to follow a different string of bits on a real-world webpage.
As always, sorry if I’m misunderstanding you, thanks for your patience :-)
This seems to me to be missing the point made well by “Embedded Agency” and exemplified by the anvil problem: you can’t in practice build a system where you can achieve this kind of thing because there is not real separation between inside the system and outside the system, just a model which assumes such a distinction exists.
Thanks for your patience, I think this is important and helpful to talk through (hope it’s as helpful for you as for me!)
Let’s introduce two terminologies I made up. First, the thing I mentioned above:
Non-optimization means that “an action leading to a “good” consequence (according to a predetermined criterion) happens no more often than chance” (e.g. a rock)
Level-1 optimization means “an action leading to a “good” consequence happens no more often than chance at first, but once it’s stumbled upon, it tends to be repeated in the future”. (e.g. bacteria)
Level-2 optimization means “an action leading to a “good” consequence is taken more often than chance from the start, because of foresight and planning”. (e.g. human)
Second, when you run a program:
Algorithm Land is where you find abstract mathematical entities like “variables”, “functions”, etc.
Real World is that place with atoms and stuff.
Now, when you run a program, you can think of what’s happening in Algorithm Land (e.g. a list of numbers is getting sorted) and what’s happening in the Real World (e.g. transistors are switching on and off). It’s really always going to be both at once.
And now let’s simplify things greatly by putting aside the case of world-modeling programs, which have a (partial, low-resolution) copy of the Real World inside Algorithm Land. Instead, let’s restrict our attention a chess-playing program or any other non-world-modeling program.
Now, in this case, when we think about Level-2 optimization, the foresight and planning involved entail searching exclusively through causal pathways that are completely inside Algorithm Land. (Why? Because without a world model, it has no way to reason about Real-World causal pathways.) In this case, I say there isn’t really anything much to worry about.
Why not worry? Think about classic weird AGI disaster scenarios. For example, the algorithm is optimizing for the “reward” value in register 94, so it hacks its RAM to overwrite the register with the biggest possible number, then seizes control of its building and the power grid to ensure that it won’t get turned off, then starts building bigger RAMs, designing killer nanomachines, and on and on. Note that ALL those things (1) involve causal pathways in the Real World (even if the action and consequence are arguably in Algorithm Land) and (2) would be astronomically unlikely to occur by random chance (which is what happens without Level-2 optimization). (I won’t say that nothing can go awry with Level-1 optimization—I have great respect for bacteria—but it’s a much easier situation to keep under control than rogue Level-2 optimization through Real-World causal pathways.)
Again, things that happen in Algorithm Land are also happening in the Real World, but the mapping is kinda arbitrary. High-impact things in Algorithm Land are not high-impact things in the Real World. For example, using RAM to send out manipulative radio signals is high-impact in the Real World, but just a random meaningless series of operations in Algorithm Land. Conversely, an ingeniously-clever chess move in Algorithm Land is just a random activation of transistors in the Real World.
(You do always get Level-1 optimization through Real-World causal pathways, with or without a world model. And you can get Level-2 optimization through Real-World causal pathways, but a necessary requirement seems to be an algorithm with a world-model and self-awareness (i.e. knowledge that there is a relation between things in Algorithm Land and things in the Real World).
Just want to note that I like your distinctions between Algorithm Land and the Real World and also between Level-1 optimization and Level-2 optimization.
I think some discussion of AI safety hasn’t been clear enough on what kind of optimization we expect in which domains. At least, it wasn’t clear to me.
But a couple things fell into place for me about 6 months ago, which very much rhyme with your two distinctions:
1) Inexploitability only makes sense relative to a utility function, and if the AI’s utility function is orthogonal to yours (e.g. because it is operating in Algorithm Land), then it may be exploitable relative to your utility function, even though it’s inexploitable relative to its own utility function. See this comment (and thanks to Rohin for the post that prompted the thought).
2) While some process that’s optimizing super-hard for an outcome in Algorithm Land may bleed out into affecting the Real World, this would sort of be by accident, and seems much easier to mitigate than a process that’s trying to affect the Real World on purpose. See this comment.
Putting them together, a randomly selected superintelligence doesn’t care about atoms, or about macroscopic events unfolding through time (roughly the domain of what we care about). And just because we run it on a computer that from our perspective is embedded in this macroscopic world, and that uses macroscopic resources (compute time, energy), doesn’t mean it’s going to start caring about macroscopic Real World events, or start fighting with us for those resources. (At least, not in a Level-2 way.)
On the other hand, powerful computing systems we build are not going to be randomly selected from the space of possible programs. We’ll have economic incentives to create systems that do consider and operate on the Real World.
So it seems to me that a randomly selected superintelligence may not actually be dangerous (because it doesn’t care about being unplugged—that’s a macroscopic concept that seems simple and natural from our perspective, but would not actually correspond to something in most utility functions), but that the superintelligent systems anyone is likely to actually build will be much more likely to be dangerous (because they will model and or act on the Real World).
Again, things that happen in Algorithm Land are also happening in the Real World, but the mapping is kinda arbitrary. High-impact things in Algorithm Land are not high-impact things in the Real World. For example, using RAM to send out manipulative radio signals is high-impact in the Real World, but just a random meaningless series of operations in Algorithm Land. Conversely, an ingeniously-clever chess move in Algorithm Land is just a random activation of transistors in the Real World.
It’s hard to be sure this separation will remain, though. An algorithm may accidentally hit upon unexpected techniques while learning like row-hammering or performing operations that cause the hardware to generate radio waves (as you point out) or otherwise behave in unexpected ways that may result in preferred outcomes by manipulating things in the “real world” outside the intended “algorithm land”.
For another example, I seem to recall a system that learned to win in a competitive environment by mallocing so much that it starved out its competitors running on the same system. It never knew about the real world consequences of its actions since it didn’t have access to know about other processes on the system, yet it carried out the behavior anyway. There are many other examples of this, and someone even collected them in a paper on arXiv, although I can’t seem to find the link now.
The point is that the separation between Algorithm Land and the Real World doesn’t exist except in our models. Even if you ran the algorithm on a computer with an air gap and placed the whole thing inside a Faraday cage, I’d still be concerned about unexpected leaks outside the sandbox of Algorithm Land into the Real World (maybe someone sneaks their phone in past security, and the optimizer learns to incidentally modify the fan on the computer it runs on to produce sounds that get exploit the phone’s microphone to transmit information to it? the possible failure scenarios are endless). Trying to maintain the separation you are looking for is known generally as “boxing” and although it’s likely an important part of a safe AI development protocol, many people, myself included, consider it inadequate on its own and not something we should rely on, but rather part of a security-in-depth approach.
OK, so I was saying here that software can optimize for something (e.g. predicting a string of bits on the basis of other bits) and it’s by default not particularly dangerous, as long as the optimization does not involve an intelligent foresight-based search through real-world causal pathways to reach the desired goal. My argument for this was (1) Such a system can do Level-1 optimization but not Level-2 optimization (with regards to real-world causal pathways unrelated to implementing the algorithm as intended), and (2) only the latter is unusually dangerous. From your response, it seems like you agree with (1) but disagree with (2). Is that right? If you disagree with (2), can you make up a scenario of something really bad and dangerous, something that couldn’t happen with today’s software, something like a Global Catastrophic Risk, that is caused by a future AI that is optimizing something but is not more specifically using a world-model to do an intelligent search through real-world causal pathways towards a desired goal?
Sure. Let’s construct the 0-optimizer. Its purpose is simply to cause there to be lots of 0s in memory (as opposed to 1s). It only knows about Algorithm Land, and even then it’s a pretty narrow model: it knows about memory and can read and write to it. Now at some point the 0-optimizer manages to get all the bits set to 0 in its addressable memory, so it would seem to have reached maximum attainment.
But it’s a hungry optimizer and keeps trying to find ways to set more bits to 0. It eventually stumbles upon a gap in security of the operating system that allows it to gain access to memory outside its address space, so it can now set those bits to 0. Obviously it does this all “accidentally”, never knowing it’s using a security exploit, it just stumbles into it and just sees memory getting written with 0s so it’s happy (this has plenty of precedent; human minds are great examples of complex systems that have limited introspective access that do lots of complex things without knowing how or why they are doing them). With some luck, it doesn’t immediately destroy itself and gets a chance to be hungry for more 0s.
Next it accidentally starts using the network interface on the computer. Although it doesn’t exactly understand what’s going on, it figures out how to get responses that just contain lots of 0s. Unfortunately for us what this is actually doing is performing a denial of service attack against other computers to get back the 0s. Now we have a powerful optimization process that’s hungry for 0s and it satisfies its hunger by filling our networks with garbage traffic.
Couple of hops on, it’s gone from denial of service attacks to wiping out our ability to use Internet service to our ability to use any EM communication channel to generating dangerously high levels of radiation that kill all life on Earth.
This story involved a lot of luck, but my expectation is that we should not underestimate how “lucky” a powerful optimizer can be, given evolution is a similarly ontologically simple process that nonetheless managed to produce some pretty complex results.
Imagine three levels, in order of increasing concern: (1) system does self-preserving action sometimes randomly, no more often than chance. (2) system does self-preserving action randomly, but once it sees the good consequences, starts doing it systematically. (3) system does self-preserving action systematically from the start, because it had foresight and motivation. Humans and problematic AIs are up at (3), a population of bacteria undergoing evolution are at (2), and a self-unaware oracle is at (1).
I think the concern with your proposal is that good consequences don’t seem like they need to be seen, they can also be the result of selection (namely, surviving). If there are a bunch of prediction projects and ones which by chance are biased (or other wise handled) in ways that increase their chances of “survival”** then as time goes on, older[4] projects are more inclined in that way* because the ones that weren’t shut down at higher rates (and ones biased/handled in the other direction shut down at really high rates).***
*Whatever that way is.
**This need not be a long lasting property, only a local one. That which was adaptive in other circumstances can be maladaptive when things change. This doesn’t seem like a problem until you consider the results of evolution—it takes a long time, but eventually it gets more general capabilities.
***This might have more to do with the way the project is handled than the code though—it depends on the difference in weight between factors, and doesn’t incorporate the consequences of multiple equilibria. It’s a rough model, and a better model would be more explicit about how the oracle’s predictions affect it’s chances of surviving, but drawing attention to that factor was the point. (And just saying “what if they ask the oracle if running it is worth it” doesn’t address the possibility that they don’t, the way this argument does.)[4] How new projects would work isn’t super clear—they could start over from scratch, use the same code as someone else but feed in different information. They might also try to imitate older projects, prorpagating tendencies which tend to keep projects alive, which might involve biased oracles. This problem might be fixed by having survorship track accuracy, but as things change, what led to accuracy in the past might lead to inaccuracy in the future.
Thanks for this helpful comment. The architecture I’m imagining is: Model-choosing code finds a good predictive world-model out of a vast but finite space of possible world-models, by running SGD on 100,000 years of YouTube videos (or whatever). So the model-chooser is explicitly an optimizer, the engineer who created the model-chooser is also explicitly an optimizer, and the eventual predictive world-model is an extremely complicated entity with superhuman world-modeling capabilities, and I am loath to say anything about what it is or what it’s going to do.
Out of these three, (1) the engineer is not problematic because it’s a human, (2) the model-chooser is not problematic because it’s (I assume and expect) a known and well-understood algorithm (e.g. Transformer), and thus (3) the eventual predictive world-model is the only thing we’re potentially worried about. My thought is that, we can protect ourselves from the predictive world-model doing problematic consequentialist planning by scheming to give it no information whatsoever about how it can affect the world, even knowing that it exists or knowing what actions it is taking, such that if it has problematic optimization tendencies, it is unable to act on them.
(In regards to (1) more specifically, if a company is designing a camera, the cameras with properties that the engineers like are preferentially copied by the engineers into later versions. Yes, this is a form of optimization, but nobody worries about it more than anything else in life. Right?)
The concern is basically:
Evolution* is well understood.
And while it’s not a really good optimizer, if you run it too long**, it’s “not safe” (it made us after all), - it can get to GI (or AGI).
I don’t think cameras think. They have a simple state which is determined by the settings (unless somethings broken), and they take images. While we could argue that they have a memory 1) in the form of images and video, and 2) in the form of state, they don’t do things with that information. (While you might be able to use them for video editing, that’s largely a manual process, and is mostly performed on (bigger) computers designed for more general use.)
Comparatively, the point of Machine Learning is...machine learning. Today that’s learning to recognize things (like human faces) and do things (like video games). If things get to the point where I don’t drive a car, instead my car drives itself—that will be pretty big.
And the thing about some of this new stuff like neural nets is that we don’t. They’re not hardcoded. They learn. And the more difficult/complex the task is, the harder they can be to understand. Currently, I’m more worried about a system which we don’t understand being put in an important role and failing unexpectedly, or people using these tools for sinister ends (I’m not the biggest fan of facial recognition tech), than a super-intelligent AGI. AlphaStar didn’t win by hacking its opponent’s computer or causing a seizure.
*Or evolutionary algorithms
**What you’re running it on (the problem you’re trying to solve, and how complex it’s allowed to get) might also matter.
Just to be clear, when OpenAI trained GPT-2, I am not saying that GPT-2 is a known and well-understood algorithm for generating text, but rather that SGD (Stochastic Gradient Descent) is a known and well-understood algorithm for generating GPT-2. (I mean, OK sure, ML researchers are still studying SGD, but its inner workings are not an impenetrable mystery the way that GPT-2′s are.)
Why should we expect, though, that if it is the case that a self-unaware oracle only performs self-preserving actions sometimes randomly that it wouldn’t incidentally optimize for that? I’m willing to believe it may be possible to construct a system with optimizing pressures weak enough that it couldn’t end up incidentally optimizing for other things that are instrumentally useful (and unsafe), and I have some vague ideas about how that might happen, but I’m unclear from what you’ve presented so far why I should expect a self-unaware oracle to be such a system.
I’m not necessarily asking for a mathematically rigorous proof, but I also don’t see a reasonable story that would lead me to conclude that, thus I fall back on my prior assumption that optimizing systems are at risk of developing secondary behaviors because they are useful in optimizing for the primary target.
I’m not sure what you have in mind here; to me, optimization requires some causal pathway from “Action X has consequence Y” to “Take Action X more often than chance”.
A system can optimize if it has a way to store specially-flagged information in the form of “I took action X, and it had consequence Y” (or “if I take action X, it will have consequence Y”), and then bring that flagged information to bear when taking actions. A population of bacteria can do this! Evolution flags its “decisions” (mutations), storing that information in DNA, and then “consults” the DNA when “deciding” what the gene distribution will be in the next generation. A self-unaware system, lacking any “I” or “my decision” or “my action” flags in either its internal or external universe, would be missing the causal links necessary to optimize anything. Right?
But if you build something that can’t optimize that’s not really AI or an oracle, that’s just regular software that doesn’t learn. I guess an expert system, for example, is functionally kind of like an oracle and it would meet your requirement of self-unawareness, but it also seems pretty uninteresting from a capabilities standpoint since any learning it does happens offline and only via external reprogramming of the algorithm (and then you just pass the buck to whatever that external thing doing the reprogramming is, be it human or another AI).
To me this is sort of like saying “hey, look, we can make provably correct software, just don’t put any loops that might be unbounded in it”. Sure, that works, but it also restricts the class of what you can achieve so much that people have generally chosen not to tradeoff the additional capabilities for correctness. I think here we’re looking at something similar here if your notion of self-unawareness also mean no optimization and learning.
A self-unaware system would not be capable of one particular type of optimization task:
Take real-world actions (“write bit 0 into register 11”) on the basis of anticipating their real-world consequences (human will read this bit and then do such-and-such).
This thing is an example of an optimization task, and it’s a very dangerous one. Maybe it’s even the only type of really dangerous optimization task! (This might be an overstatement, not sure.) Not all optimization tasks are in this category, and a system can be intelligent by doing other different types of optimization tasks.
A self-unaware system certainly is an optimizer in the sense that it does other (non-real-world) optimization tasks, in particular, finding the string of bits that would be most likely to follow a different string of bits on a real-world webpage.
As always, sorry if I’m misunderstanding you, thanks for your patience :-)
This seems to me to be missing the point made well by “Embedded Agency” and exemplified by the anvil problem: you can’t in practice build a system where you can achieve this kind of thing because there is not real separation between inside the system and outside the system, just a model which assumes such a distinction exists.
Thanks for your patience, I think this is important and helpful to talk through (hope it’s as helpful for you as for me!)
Let’s introduce two terminologies I made up. First, the thing I mentioned above:
Non-optimization means that “an action leading to a “good” consequence (according to a predetermined criterion) happens no more often than chance” (e.g. a rock)
Level-1 optimization means “an action leading to a “good” consequence happens no more often than chance at first, but once it’s stumbled upon, it tends to be repeated in the future”. (e.g. bacteria)
Level-2 optimization means “an action leading to a “good” consequence is taken more often than chance from the start, because of foresight and planning”. (e.g. human)
Second, when you run a program:
Algorithm Land is where you find abstract mathematical entities like “variables”, “functions”, etc.
Real World is that place with atoms and stuff.
Now, when you run a program, you can think of what’s happening in Algorithm Land (e.g. a list of numbers is getting sorted) and what’s happening in the Real World (e.g. transistors are switching on and off). It’s really always going to be both at once.
And now let’s simplify things greatly by putting aside the case of world-modeling programs, which have a (partial, low-resolution) copy of the Real World inside Algorithm Land. Instead, let’s restrict our attention a chess-playing program or any other non-world-modeling program.
Now, in this case, when we think about Level-2 optimization, the foresight and planning involved entail searching exclusively through causal pathways that are completely inside Algorithm Land. (Why? Because without a world model, it has no way to reason about Real-World causal pathways.) In this case, I say there isn’t really anything much to worry about.
Why not worry? Think about classic weird AGI disaster scenarios. For example, the algorithm is optimizing for the “reward” value in register 94, so it hacks its RAM to overwrite the register with the biggest possible number, then seizes control of its building and the power grid to ensure that it won’t get turned off, then starts building bigger RAMs, designing killer nanomachines, and on and on. Note that ALL those things (1) involve causal pathways in the Real World (even if the action and consequence are arguably in Algorithm Land) and (2) would be astronomically unlikely to occur by random chance (which is what happens without Level-2 optimization). (I won’t say that nothing can go awry with Level-1 optimization—I have great respect for bacteria—but it’s a much easier situation to keep under control than rogue Level-2 optimization through Real-World causal pathways.)
Again, things that happen in Algorithm Land are also happening in the Real World, but the mapping is kinda arbitrary. High-impact things in Algorithm Land are not high-impact things in the Real World. For example, using RAM to send out manipulative radio signals is high-impact in the Real World, but just a random meaningless series of operations in Algorithm Land. Conversely, an ingeniously-clever chess move in Algorithm Land is just a random activation of transistors in the Real World.
(You do always get Level-1 optimization through Real-World causal pathways, with or without a world model. And you can get Level-2 optimization through Real-World causal pathways, but a necessary requirement seems to be an algorithm with a world-model and self-awareness (i.e. knowledge that there is a relation between things in Algorithm Land and things in the Real World).
Just want to note that I like your distinctions between Algorithm Land and the Real World and also between Level-1 optimization and Level-2 optimization.
I think some discussion of AI safety hasn’t been clear enough on what kind of optimization we expect in which domains. At least, it wasn’t clear to me.
But a couple things fell into place for me about 6 months ago, which very much rhyme with your two distinctions:
1) Inexploitability only makes sense relative to a utility function, and if the AI’s utility function is orthogonal to yours (e.g. because it is operating in Algorithm Land), then it may be exploitable relative to your utility function, even though it’s inexploitable relative to its own utility function. See this comment (and thanks to Rohin for the post that prompted the thought).
2) While some process that’s optimizing super-hard for an outcome in Algorithm Land may bleed out into affecting the Real World, this would sort of be by accident, and seems much easier to mitigate than a process that’s trying to affect the Real World on purpose. See this comment.
Putting them together, a randomly selected superintelligence doesn’t care about atoms, or about macroscopic events unfolding through time (roughly the domain of what we care about). And just because we run it on a computer that from our perspective is embedded in this macroscopic world, and that uses macroscopic resources (compute time, energy), doesn’t mean it’s going to start caring about macroscopic Real World events, or start fighting with us for those resources. (At least, not in a Level-2 way.)
On the other hand, powerful computing systems we build are not going to be randomly selected from the space of possible programs. We’ll have economic incentives to create systems that do consider and operate on the Real World.
So it seems to me that a randomly selected superintelligence may not actually be dangerous (because it doesn’t care about being unplugged—that’s a macroscopic concept that seems simple and natural from our perspective, but would not actually correspond to something in most utility functions), but that the superintelligent systems anyone is likely to actually build will be much more likely to be dangerous (because they will model and or act on the Real World).
It’s hard to be sure this separation will remain, though. An algorithm may accidentally hit upon unexpected techniques while learning like row-hammering or performing operations that cause the hardware to generate radio waves (as you point out) or otherwise behave in unexpected ways that may result in preferred outcomes by manipulating things in the “real world” outside the intended “algorithm land”.
For another example, I seem to recall a system that learned to win in a competitive environment by mallocing so much that it starved out its competitors running on the same system. It never knew about the real world consequences of its actions since it didn’t have access to know about other processes on the system, yet it carried out the behavior anyway. There are many other examples of this, and someone even collected them in a paper on arXiv, although I can’t seem to find the link now.
The point is that the separation between Algorithm Land and the Real World doesn’t exist except in our models. Even if you ran the algorithm on a computer with an air gap and placed the whole thing inside a Faraday cage, I’d still be concerned about unexpected leaks outside the sandbox of Algorithm Land into the Real World (maybe someone sneaks their phone in past security, and the optimizer learns to incidentally modify the fan on the computer it runs on to produce sounds that get exploit the phone’s microphone to transmit information to it? the possible failure scenarios are endless). Trying to maintain the separation you are looking for is known generally as “boxing” and although it’s likely an important part of a safe AI development protocol, many people, myself included, consider it inadequate on its own and not something we should rely on, but rather part of a security-in-depth approach.
OK, so I was saying here that software can optimize for something (e.g. predicting a string of bits on the basis of other bits) and it’s by default not particularly dangerous, as long as the optimization does not involve an intelligent foresight-based search through real-world causal pathways to reach the desired goal. My argument for this was (1) Such a system can do Level-1 optimization but not Level-2 optimization (with regards to real-world causal pathways unrelated to implementing the algorithm as intended), and (2) only the latter is unusually dangerous. From your response, it seems like you agree with (1) but disagree with (2). Is that right? If you disagree with (2), can you make up a scenario of something really bad and dangerous, something that couldn’t happen with today’s software, something like a Global Catastrophic Risk, that is caused by a future AI that is optimizing something but is not more specifically using a world-model to do an intelligent search through real-world causal pathways towards a desired goal?
Sure. Let’s construct the 0-optimizer. Its purpose is simply to cause there to be lots of 0s in memory (as opposed to 1s). It only knows about Algorithm Land, and even then it’s a pretty narrow model: it knows about memory and can read and write to it. Now at some point the 0-optimizer manages to get all the bits set to 0 in its addressable memory, so it would seem to have reached maximum attainment.
But it’s a hungry optimizer and keeps trying to find ways to set more bits to 0. It eventually stumbles upon a gap in security of the operating system that allows it to gain access to memory outside its address space, so it can now set those bits to 0. Obviously it does this all “accidentally”, never knowing it’s using a security exploit, it just stumbles into it and just sees memory getting written with 0s so it’s happy (this has plenty of precedent; human minds are great examples of complex systems that have limited introspective access that do lots of complex things without knowing how or why they are doing them). With some luck, it doesn’t immediately destroy itself and gets a chance to be hungry for more 0s.
Next it accidentally starts using the network interface on the computer. Although it doesn’t exactly understand what’s going on, it figures out how to get responses that just contain lots of 0s. Unfortunately for us what this is actually doing is performing a denial of service attack against other computers to get back the 0s. Now we have a powerful optimization process that’s hungry for 0s and it satisfies its hunger by filling our networks with garbage traffic.
Couple of hops on, it’s gone from denial of service attacks to wiping out our ability to use Internet service to our ability to use any EM communication channel to generating dangerously high levels of radiation that kill all life on Earth.
This story involved a lot of luck, but my expectation is that we should not underestimate how “lucky” a powerful optimizer can be, given evolution is a similarly ontologically simple process that nonetheless managed to produce some pretty complex results.