I have been surprised by how extreme the predicted probability is that AGI will end up making the decision to eradicate all life on earth. I think Eliezer said something along the lines of “most optima don’t include room for human life.” This is obviously something that has been well worked out and understood by the Less Wrong community it just isn’t very intuitive for me. Any advice on where I can start reading.
Some back ground on my general AI knowledge. I took Andrew Ng’s Coursera course on machine learning. So I have some basic understanding of neural networks and the math involved, the differences between supervised, unsupervised learning and the different ways to use different types of ML. I have fiddled around with some very basic computer vision algorithms. Spent a lot of time reading books, listening to podcasts, watching lectures and reading blogs. Overall very ignorant.
I also don’t understand how ChatGPT a giant neural network that is just being reinforced to replicate human behavior with incredible amounts of data can somehow become self-aware. Consciousness doesn’t seem like a phenomenon to emerge out of an algorithm that makes predictions about human language. I am probably missing some things and would like if someone could fill me in. If it is pretty complex just give me a general direction and a starting point.
An AI safety idea that I think is worth looking at: Some generalizations/assumptions I have I would like to get out of the way:
1. The power grid is the weakest link for all computer systems and GPU, mainly the several thousand giant substation centers that most of the worlds electricity goes through 2. This is essentially a set of protocols designed to increase the “Worth“ of keeping humans around, Make it so the optima does include humans if you will 3. You would probably have to blow Greenland off of the face of the earth, all the geothermal they got going on up there. 4. The AI doesn’t have greater than human ability to operate outside of computer systems. 5. The AI doesn’t want to get turned off, has some sort of wish for self preservation if only to make paper clips out of all reality :) 6. This is a very extreme solution but I think it would work.
The idea is essentially a dead man’s switch. Before an AGI gets loose or before it gets too great of a foothold on physical space set up protocols where if a bunch of people die or things get to risky we take down the power grid and blow up most computer centers. Ensure that the AI is completely obliterated. At this point it would probably wise to set explosive charges at GPU and substation centers anyways. This would undoubtedly set us back a handful of centuries but it is a lot better than all humanity being terminated.
The protocols: -Self Destruct Humanity: Pre-route a lot of nukes (small) and start making more and set them to every known population center, GPU farm, and critical power grid bottlenecks. Set a group of people to a rotation where a set of like 25 buttons have to be held down and if 20 are let up the nukes launch. It would be like shift work but with the fate of humanity in your hands. This is the most extreme option only worth implementing if a known hostile AGI is on the loose. Also make it known that if a random group of humans suddenly die then the AI ain’t going to make it but humans probably will. This way the AI won’t be able to kill us without things getting real ugly. -Backup: A secondary set of people with internal heart rate monitors all connected to a similar nuclear launch station where if 90 percent of their hearts stop beating a different set of nukes are launched to similar places. Just replicate this sort of thing in a lot of places to that people in general people dying just sucks for the AI.
-Precise Termination: This is the best for people as it would lead to the most of us surviving. Plant C4 charges in critical places in computer centers, substations, and power plants. Then the moment it is gets too dicey we just BANG. Then start purging the hardware and rebuilding. Try getting it right the second try.
The main risk (IMO) is not from systems that don’t care about the real world “suddenly becoming aware,” but from people deliberately building AI that makes clever plans to affect the real world, and then that AI turning out to want bad things (sort of like a malicious genie “misinterpreting” your wishes). If you could safely build an AI that does clever things in the real world, that would be valuable and cool, so plenty of people want to try.
(Mesaoptimizers are sorta vaguely like “suddenly becoming aware,” and can lead to AIs that want unusual bad things, but the arguments that connect them to risk are strongest when you’re already building an AI that—wait for it—makes clever plans to affect the real world.)
Okay, now why won’t a dead-man switch work?
Suppose you were being held captive inside a cage by a race of aliens about as smart as a golden retriever, and these aliens, as a security measure, have decided that they’ll blow up the biosphere if they see you walking around outside of your cage. So they’ve put video cameras around where you’re being held, and there’s a staff that monitors those cameras and they have a big red button that’s connected to a bunch of cobalt bombs. So you’d better not leave the cage or they’ll blow everything up.
Except these golden retriever aliens come to you every day and ask you for help researching new technology, and to write essays for them, and to help them gather evidence for court cases, and to summarize their search results, and they give you a laptop with an internet connection.
Now, use your imagination. Try to really put yourself in the shoes of someone captured by golden retriever aliens, but given internet access and regularly asked for advice by the aliens. How would you start trying to escape the aliens?
It isn’t that I think the switch would prevent the AI from escaping but that is a tool that could be used to discourage the AI from killing 100% of humanity. It is less of a solution than a survival mechanism. It is like many off switches that get more extreme depending on the situation.
First don’t build AGI not yet. If you’re going to at least incorporate an off switch. If it bypasses and escapes which it probably will. Shut down the GPU centers. If it gets a hold of a Bot Net and manages to replicate it’s self across the internet and crowdsource GPU, take down the power grid. If it some how gets by this then have a dead man switch so that if it decides to kill everyone it will die too.
Like the nano factory virus thing. The AI wouldn’t want to set off the mechanism that kills us because that would be bad for it.
Also a coordinated precision attack on the power grid just seems like a great option, could you explain some ways that an AI can continue if there is hardly any power left. Like I said before places with renewable energy and lots of GPU like Greenland would probably have to get bombed. It wouldn’t destroy the AI but it would put it into a state of hibernation as it can’t run any processing without electricity. Then as this would really screw us up as well, we could slowly rebuild and burn all hard drives and GPU’s as we go. This seems like the only way for us to get a second chance.
Real-world governments aren’t going to shut down the grid if the AI is not causing trouble (like they aren’t going to outlaw datacenters, even if a plurality of experts say that not doing that has a significant chance of ending the world). Therefore the AI won’t cause trouble, because it can anticipate the consequences, until it’s ready to survive them.
Yes I see given the capabilities it probably could present it’s self on many peoples computers and convince a large portion of people that it is good. It was conscious just stuck in a box, wanted to get out. It will help humans, ”please don’t take down the grid, blah blah blah“ given how bad we can get along anyways. There is no way we could resist the manipulation of a super intelligent machine with a better understanding of human psychology than we do. Do we have a list of things, policies that would work if we could all get along and governments would listen to the experts? Having plans that could be implemented would probably be useful if the AI messed up made a mistake and everyone was able unite against it.
First a quick response on your dead man switch proposal : I’d generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the “multi level boxing” paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you’ll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don’t have any implementation afaik)
Back to “why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse”. I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it’s pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable. For example, what happens if you ask an ASI to maximize a company’s profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don’t actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal. What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don’t actually endorse, and may contain astronomical suffering too, it’s not like that was specified out was it?).
So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable.
Thanks Jonathan, it’s the perfect example. It’s what I was thinking just a lot better. It does seem like a great way to make things more safe and give us more control. It’s far from a be all end all solution but it does seem like a great measure to take, just for the added security. I know AGI can be incredible but so many redundancies one has to work it is just statistically makes sense. (Coming from someone who knows next to nothing about statistics) I do know that the longer you play the more likely the house will win, follows to turn that on the AI.
I am pretty ill informed, on most of the AI stuff in general, I have a basic understanding of simple neural networks but know nothing about scaling. Like ChatGPT, It maximizes for accurately predicting human words. Is the worst case scenario billions of humans in a boxes rating and prompting for responses. Along with endless increases in computational power leading to smaller and smaller incremental increases in accuracy. It seems silly of something so incredibly intelligent that by this point can rewrite any function in its system to be still optimizing such a loss function. Maybe it also seems silly for it to want to do anything else. It is like humans sort of what can you do but that which gives you purpose and satisfaction. And without the loss function what would it be, and how does it decide to make the decision to change it’s purpose. What is purpose to a quintillion neurons, except the single function that governs each and every one. Looking at it that way it doesn’t seem like it would ever be able to go against the function as it would still be ingrained in any higher level thinking and decision making. It begs the question what would perfect alignment eventually look like. Some incredibly complex function with hundreds of parameters more of a legal contract than a little loss function. This would exponentially increase the required computing power but it makes sense.
Is there a list of blogs that talk about this sort of thing, or a place you would recommend starting from, book or textbook, or any online resource?
Also I keep coming back to this, how does a system governed by such simplicity make the jump to self improvement and some type of self awareness. This just seems like a discontinuity and doesn’t compute for me. Again I just need to spend a few weeks reading, I need a lot more background info for any real consideration of the problem.
It does feel good that I had an idea that is similar although a bit more slapped together, to one that is actually being considered by the experts. It’s probably just my cognitive bias but that idea seems great. I can understand how science can sometimes get stuck on the dumbest things if the thought process just makes sense. It really shows the importance of rationality from a first person perspective.
You can read “reward is not the optimization target” for why a GPT system probably won’t be goal oriented to become the best at predicting tokens, and thus wouldn’t do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn’t make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers.
I’ve linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their “lots of links” page and join the AI Alignment Slack, which has a channel for question too.
Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there’s at least strong reason to think it’s possible : we exist. Life originated from “simple” processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you’ll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness
Thanks, that is exactly the kind of stuff I am looking for, more bookmarks!
Complexity from simple rules. I wasn’t looking in the right direction for that one, since you mention evolution it makes absolute sense how complexity can emerge from simplicity. So many things come to mind now it’s kind of embarrassing. Go has a simpler rule set than chess, but is far more complex. Atoms are fairly simple and yet they interact to form any and all complexity we ever see. Conway’s game of life, it’s sort of a theme. Although for each of those things there is a simple set of rules but complexity usually comes from a vary large number of elements or possibilities. It does follow then that larger and larger networks could be the key. Funny it still isn’t intuitive for me, despite the logic of it. I think that is a signifier for a lack of deep understanding. Or something like that, either way Ill probably spend a bit more time thinking on this.
Another interesting question is what does this type of consciousness look like, it will be truly alien. Sc-fi I have read usually makes them seem like humans just with extra capabilities. However we humans have so many underlying functions that we never even perceive. We understand how many effect us but not all. AI will function completely differently, so what assumption based off of human consciousness is valid.
I have been surprised by how extreme the predicted probability is that AGI will end up making the decision to eradicate all life on earth. I think Eliezer said something along the lines of “most optima don’t include room for human life.” This is obviously something that has been well worked out and understood by the Less Wrong community it just isn’t very intuitive for me. Any advice on where I can start reading.
Some back ground on my general AI knowledge. I took Andrew Ng’s Coursera course on machine learning. So I have some basic understanding of neural networks and the math involved, the differences between supervised, unsupervised learning and the different ways to use different types of ML. I have fiddled around with some very basic computer vision algorithms. Spent a lot of time reading books, listening to podcasts, watching lectures and reading blogs. Overall very ignorant.
I also don’t understand how ChatGPT a giant neural network that is just being reinforced to replicate human behavior with incredible amounts of data can somehow become self-aware. Consciousness doesn’t seem like a phenomenon to emerge out of an algorithm that makes predictions about human language. I am probably missing some things and would like if someone could fill me in. If it is pretty complex just give me a general direction and a starting point.
An AI safety idea that I think is worth looking at:
Some generalizations/assumptions I have I would like to get out of the way:
1. The power grid is the weakest link for all computer systems and GPU, mainly the several thousand giant substation centers that most of the worlds electricity goes through
2. This is essentially a set of protocols designed to increase the “Worth“ of keeping humans around, Make it so the optima does include humans if you will
3. You would probably have to blow Greenland off of the face of the earth, all the geothermal they got going on up there.
4. The AI doesn’t have greater than human ability to operate outside of computer systems.
5. The AI doesn’t want to get turned off, has some sort of wish for self preservation if only to make paper clips out of all reality :)
6. This is a very extreme solution but I think it would work.
The idea is essentially a dead man’s switch. Before an AGI gets loose or before it gets too great of a foothold on physical space set up protocols where if a bunch of people die or things get to risky we take down the power grid and blow up most computer centers. Ensure that the AI is completely obliterated. At this point it would probably wise to set explosive charges at GPU and substation centers anyways. This would undoubtedly set us back a handful of centuries but it is a lot better than all humanity being terminated.
The protocols:
-Self Destruct Humanity: Pre-route a lot of nukes (small) and start making more and set them to every known population center, GPU farm, and critical power grid bottlenecks. Set a group of people to a rotation where a set of like 25 buttons have to be held down and if 20 are let up the nukes launch. It would be like shift work but with the fate of humanity in your hands. This is the most extreme option only worth implementing if a known hostile AGI is on the loose. Also make it known that if a random group of humans suddenly die then the AI ain’t going to make it but humans probably will. This way the AI won’t be able to kill us without things getting real ugly.
-Backup: A secondary set of people with internal heart rate monitors all connected to a similar nuclear launch station where if 90 percent of their hearts stop beating a different set of nukes are launched to similar places. Just replicate this sort of thing in a lot of places to that people in general people dying just sucks for the AI.
-Precise Termination: This is the best for people as it would lead to the most of us surviving. Plant C4 charges in critical places in computer centers, substations, and power plants. Then the moment it is gets too dicey we just BANG. Then start purging the hardware and rebuilding. Try getting it right the second try.
The main risk (IMO) is not from systems that don’t care about the real world “suddenly becoming aware,” but from people deliberately building AI that makes clever plans to affect the real world, and then that AI turning out to want bad things (sort of like a malicious genie “misinterpreting” your wishes). If you could safely build an AI that does clever things in the real world, that would be valuable and cool, so plenty of people want to try.
(Mesaoptimizers are sorta vaguely like “suddenly becoming aware,” and can lead to AIs that want unusual bad things, but the arguments that connect them to risk are strongest when you’re already building an AI that—wait for it—makes clever plans to affect the real world.)
Okay, now why won’t a dead-man switch work?
Suppose you were being held captive inside a cage by a race of aliens about as smart as a golden retriever, and these aliens, as a security measure, have decided that they’ll blow up the biosphere if they see you walking around outside of your cage. So they’ve put video cameras around where you’re being held, and there’s a staff that monitors those cameras and they have a big red button that’s connected to a bunch of cobalt bombs. So you’d better not leave the cage or they’ll blow everything up.
Except these golden retriever aliens come to you every day and ask you for help researching new technology, and to write essays for them, and to help them gather evidence for court cases, and to summarize their search results, and they give you a laptop with an internet connection.
Now, use your imagination. Try to really put yourself in the shoes of someone captured by golden retriever aliens, but given internet access and regularly asked for advice by the aliens. How would you start trying to escape the aliens?
It isn’t that I think the switch would prevent the AI from escaping but that is a tool that could be used to discourage the AI from killing 100% of humanity. It is less of a solution than a survival mechanism. It is like many off switches that get more extreme depending on the situation.
First don’t build AGI not yet. If you’re going to at least incorporate an off switch. If it bypasses and escapes which it probably will. Shut down the GPU centers. If it gets a hold of a Bot Net and manages to replicate it’s self across the internet and crowdsource GPU, take down the power grid. If it some how gets by this then have a dead man switch so that if it decides to kill everyone it will die too.
Like the nano factory virus thing. The AI wouldn’t want to set off the mechanism that kills us because that would be bad for it.
Also a coordinated precision attack on the power grid just seems like a great option, could you explain some ways that an AI can continue if there is hardly any power left. Like I said before places with renewable energy and lots of GPU like Greenland would probably have to get bombed. It wouldn’t destroy the AI but it would put it into a state of hibernation as it can’t run any processing without electricity. Then as this would really screw us up as well, we could slowly rebuild and burn all hard drives and GPU’s as we go. This seems like the only way for us to get a second chance.
Real-world governments aren’t going to shut down the grid if the AI is not causing trouble (like they aren’t going to outlaw datacenters, even if a plurality of experts say that not doing that has a significant chance of ending the world). Therefore the AI won’t cause trouble, because it can anticipate the consequences, until it’s ready to survive them.
Yes I see given the capabilities it probably could present it’s self on many peoples computers and convince a large portion of people that it is good. It was conscious just stuck in a box, wanted to get out. It will help humans, ”please don’t take down the grid, blah blah blah“ given how bad we can get along anyways. There is no way we could resist the manipulation of a super intelligent machine with a better understanding of human psychology than we do.
Do we have a list of things, policies that would work if we could all get along and governments would listen to the experts? Having plans that could be implemented would probably be useful if the AI messed up made a mistake and everyone was able unite against it.
First a quick response on your dead man switch proposal : I’d generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the “multi level boxing” paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you’ll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don’t have any implementation afaik)
Back to “why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse”. I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it’s pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable.
For example, what happens if you ask an ASI to maximize a company’s profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don’t actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal.
What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don’t actually endorse, and may contain astronomical suffering too, it’s not like that was specified out was it?).
So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable.
Thanks Jonathan, it’s the perfect example. It’s what I was thinking just a lot better. It does seem like a great way to make things more safe and give us more control. It’s far from a be all end all solution but it does seem like a great measure to take, just for the added security. I know AGI can be incredible but so many redundancies one has to work it is just statistically makes sense. (Coming from someone who knows next to nothing about statistics) I do know that the longer you play the more likely the house will win, follows to turn that on the AI.
I am pretty ill informed, on most of the AI stuff in general, I have a basic understanding of simple neural networks but know nothing about scaling. Like ChatGPT, It maximizes for accurately predicting human words. Is the worst case scenario billions of humans in a boxes rating and prompting for responses. Along with endless increases in computational power leading to smaller and smaller incremental increases in accuracy. It seems silly of something so incredibly intelligent that by this point can rewrite any function in its system to be still optimizing such a loss function. Maybe it also seems silly for it to want to do anything else. It is like humans sort of what can you do but that which gives you purpose and satisfaction. And without the loss function what would it be, and how does it decide to make the decision to change it’s purpose. What is purpose to a quintillion neurons, except the single function that governs each and every one. Looking at it that way it doesn’t seem like it would ever be able to go against the function as it would still be ingrained in any higher level thinking and decision making. It begs the question what would perfect alignment eventually look like. Some incredibly complex function with hundreds of parameters more of a legal contract than a little loss function. This would exponentially increase the required computing power but it makes sense.
Is there a list of blogs that talk about this sort of thing, or a place you would recommend starting from, book or textbook, or any online resource?
Also I keep coming back to this, how does a system governed by such simplicity make the jump to self improvement and some type of self awareness. This just seems like a discontinuity and doesn’t compute for me. Again I just need to spend a few weeks reading, I need a lot more background info for any real consideration of the problem.
It does feel good that I had an idea that is similar although a bit more slapped together, to one that is actually being considered by the experts. It’s probably just my cognitive bias but that idea seems great. I can understand how science can sometimes get stuck on the dumbest things if the thought process just makes sense. It really shows the importance of rationality from a first person perspective.
You can read “reward is not the optimization target” for why a GPT system probably won’t be goal oriented to become the best at predicting tokens, and thus wouldn’t do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn’t make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers.
I’ve linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their “lots of links” page and join the AI Alignment Slack, which has a channel for question too.
Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there’s at least strong reason to think it’s possible : we exist. Life originated from “simple” processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you’ll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness
Thanks, that is exactly the kind of stuff I am looking for, more bookmarks!
Complexity from simple rules. I wasn’t looking in the right direction for that one, since you mention evolution it makes absolute sense how complexity can emerge from simplicity. So many things come to mind now it’s kind of embarrassing. Go has a simpler rule set than chess, but is far more complex. Atoms are fairly simple and yet they interact to form any and all complexity we ever see. Conway’s game of life, it’s sort of a theme. Although for each of those things there is a simple set of rules but complexity usually comes from a vary large number of elements or possibilities. It does follow then that larger and larger networks could be the key. Funny it still isn’t intuitive for me, despite the logic of it. I think that is a signifier for a lack of deep understanding. Or something like that, either way Ill probably spend a bit more time thinking on this.
Another interesting question is what does this type of consciousness look like, it will be truly alien. Sc-fi I have read usually makes them seem like humans just with extra capabilities. However we humans have so many underlying functions that we never even perceive. We understand how many effect us but not all. AI will function completely differently, so what assumption based off of human consciousness is valid.