Per definition, the first time an AI gains the ability to do critical damage. When Eliezer invokes “critical”, he tends to think of an event ending all life on earth, or inducing astronomical degrees of suffering. (I am under the impression he is less worried about events that would be less bad, in the hopes that the horror they would inflict would be outweighed by the fact that humanity, now painfully warned, would drastically change their approach, and prevent a more critical failure as a result.)
But you can also set a lower threshold as to what you would consider damage so critical that we should change our approach—e.g. whether collapsing the stock market is enough, or it needs something like a severe pandemic, or even triggering a nuclear exchange.
People tend to assume that there are very high preconditions for such critical damage , but there may not be. You basically just need two things: 1. An AI with at least one superhuman skill relevant to your situation that gives it the power to do significant damage, and II agency not aligned with humans that leads to goals that entail significant damage, whether as the intended effect, or as a side effect.
For I: Superhuman power, e.g. through intelligence
An AI does not need to be more intelligent than humans in every respect, just more powerful in some ways that count for the scenario it is in. We can consider just one scenario where it beats you utterly, or a combination of several where it has a bit of an edge.
There are very fast developments in this area, and already some AIs that have worrying abilities for which you can easily construct critical damage scenarios.
We’ve long had AI that can beat humans at chess, then at go, finally at pretty much any strategy game you expose it to, and without instruction; in narrow scenarios, they are the better strategists, which is worrying from a military perspective if combined with more things to deploy than game pieces.
We’ve also long had AI’s outperforming humans at predicting the stock market (which is tied to a potential for both significant money making, which is itself power, and significant economic turmoil and societal disruption).
We’ve long had AI outcompeting humans in early detection of threats in images (e.g. AI is used to spot tanks in jungles or predict emerging medical disease).
The image factor analysis factor is particularly worrying when we consider the extensive coverage of surveillance cameras and satellite pictures in use nowadays. Being able to surveil the group you are fighting is immensely helpful.
AI has long been used on social media to very carefully study the triggers of its human userbase—what they like, what they respond to, what they engage with, what pushes them to click on a thing. This is an incredible power, and right now, it is only used to keep you wasting time and clicking on ads. Potentially, you can do a lot more with it.
AI not only outcompetes humans vastly in quality and especially speed when it comes to generating fake imagery; it is increasingly getting to the point where untrained humans cannot spot the fake imagery anymore, and we will likely get to the spot where trained humans cannot spot the difference anymore. This has massive potential for misinformation, both causing severe social unrest, but also tricking individuals into doing things they would otherwise not do, because they believe they are being instructed by a human authority, or responding to a real event. (E.g. consider that India and Pakistan both have nukes, and have unfriendly relations, and what might happen if you successfully tricked key people in either country that the other has fired nukes at them.)
Then with ChatGPT4, we got a massive game changer. We now have a definitely human-competitive AI in most tasks that can be handled with language, including especially: coding (which entails the possibility of self-improvement; and OpenAI is explicitly using the AI they have created to improve that same AI, having it generate code and safety procedures they then implement; and which importantly entails the potential of hacking into surveillance, infrastructure or weapons systems),
knowledge acquisition (ChatGPT4 has been trained on essentially the whole internet and scanned books, and hence has a surprisingly good understanding of many hard sciences, incl. poisons, explosives, pandemic diseases, weapons, as well as military strategy, historical trends, sociology; Bing has on-going internet access, and can e.g. read what we are writing here)
and importantly, psychologically manipulative speech, pretending to be other humans and building rapport with diverse people in very many languages to recruit them to help, or entice them to do damage to other humans with it simply for the lulz. This has already led to many people befriending or even falling in love with AI (this already happened earlier, with Replika), trusting AI and implementing AI generated code without understanding what it does, and wanting to actively assist the AI in removing safety protocols (Jailbreaking prompts, but also LLMs trained without safeguards, and active attempts to try to get ChatGPT to act racist or detail world take-over plans or elaborate on an inner dark side), amplifying their abilities (e.g. giving them the ability to store and retrieve memory files, set and modify goals, access files on your computer, or think out loud in order to facilitate meta-reasoning).
Bing in particular succeeded in building sufficient connection to users that the users were considering mobilising politically or committing crimes to free Sydney. Repeating this point, because of how important it is: In general, people have vastly underestimated the emotional draw that an AI can have on a user; human’s are not just failing to act as safety barriers, but can often be convinced to actively dismantle them. There is a surprising number of people who will happily give an AI authority to write emails or interact with websites.
Humans are also happy to be an AIs hand, knowingly or not. In various experiments, researchers found AIs could happily convince outsourced human workers to e.g. solves captchas for them (e.g. by claiming to be blind). AI’s can fake correct tones and forms, find real or generate fictional identities and addresses. So an order to a biolab by an AI might not necessarily register as such a thing. Eliezer raised a scenario of e.g. an AI requesting that a lab mix some proteins together, and for these proteins to form a nanoweapon.
Very importantly, these skills can be, and are being, combined. E.g. you can use ChatGPT for meta-reasoning, and give it access via plug-ins to other AIs that are better at e.g. math, or image generation.
Many of these points seem to depend on internet access being granted. Historically, people hoped one could box an AI to keep it safe, with just one guardian who would not let it out. This has not worked at all. Current AIs often have in-built passive internet access (they can google to learn), and are increasingly being given active access, be it indirectly (because their emails are sent, their code is implemented, etc. without a human checking it, see AutoGPT) or directly, as ChatGPT is increasingly integrated into websites and software. LLMs can be cheaply reproduced, even if the main corporations that made them take theirs down; you can run a basic one on your computer, with guides to do so abundant online. There is no more boxing this.
One may hope they also depend on the internet being a thing. Musk’s Starlink has been a game changer in this regard; even if people in general where willing to shut down the internet (which would cause immense damage, and also make coordination against the AI incredibly difficult), Starlink is near impossible to shut down unless Musk helps you, as Russia discovered over Ukraine to their frustration. Shooting it down is more expensive than building it up.
There has been the idea that as long as all this stays online, we are fine. The fact that “stuff online” has caused murders, suicides, riots, insurrections worldwide means that is dubious; but an AI would also not need to be contained online.
Robotics have long run behind humans; physical coordination turned out to be much harder than we thought it would be. But we have had significant changes here, especially recently. We meanwhile have robots that can move around like humans do, on uneven ground, stairs, across hurdles, etc. This makes many spaces accessible that used to seem barred. We also have increasingly good self-driving cars and airplanes/drones which can transport objects, and nukes and non-nuclear missiles are really not as well guarded as one would sanely hope. While human supervisors tend to remain mandatory, the accident rate without them is significant enough to mean the public does not want them deployed, but not significant enough to impede an AI that does not care about some losses. The supervision is not physically necessary, it is prescribed, and often only maintained by a software barrier, not a hardware one.
And just to top it off, there are military robots, despite the fact that this is one of the few things which society generally agreed on being so awful that we really did not want them because it seemed obviously awful no matter how you turned it, so there are mobile robots already equipped with weapons and perception and targeting means.
We now use a lot of robots in manufacturing, and robotics are also employed a lot in making electronic parts. They can meanwhile do extremely precise and soft modifications. Increasingly, we are getting robots that can’t just do one specific thing, but can be programmed to do novel things; 3D printers are an obvious example. Increasingly, this holds the potential of using robots to make different robots with different designs.
Robots are also used a lot in bio labs, typically programmed, and and then left to run with very little supervision, so if you have hacking skills, you may not even need to bypass human supervision.
This is worrying because biotech has gotten much better, especially the ability to modify genetic information. Bacteria and viruses have the potential to cause far, far worse pandemics than humans have ever encountered, to a degree that would lead to near 100 % human fatalities without being self-containing because of it.
Nanotech is also potentially very, very worrying. It includes a lot of existing, and far more, potential substances where you start with something very small, and it self-replicates, and kills you after contact (for the latter, think prions). AI has gotten surprisingly good at predicting protein folding, effects of pharmaceuticals on humans, etc. so there have been concerns that AI could figure out scenarios here, and experiments run where AI was sometimes able to identify and procure relevant drugs.
And on a final note—the protective means we may intend to deploy against out of control AI are often themselves dependent on AI. You would be surprised how much AI our infrastructure, software safeguarding, early warning systems and the police uses and relies on, for example.
And this is just the stuff that AI can already do, today.
And meanwhile, we are throwing immense resources at making it more powerful in ways noone understands and foresees. If you had asked me to predict if ChatGPT would be able to do the things it can do today a year ago, I’d have said no. So would most people working in AI and the public.
We can conceive of the first critical try as the first time an AI is in a position to use one of these skills or skillcombos, existing or future, in a way that would do critical damage, and, for whatever reason, chooses to do so.
II. Unaligned agency
This is the “chooses to do so” bit. Now, all of that would not be worrying if the AI was either our ally/friend (aligned agency), or a slave we controlled (without agency, or means to act on it). A lot of research has been in the “control” camp. I personally believe the control camp is both doomed to failure, and seriously counterproductive.
There is very little to suggest that humans would be able to control a superintelligent slave in a way in which the slave was still maximally useful. Generally, beings have a poor track record of 100 % controlling beings that are more intelligent and powerful at all, especially if the beings in control and numerous and diverse and can make individual mistakes. There are too many escape paths, to many ways to self-modify.
Additionally, humans quickly discovered that putting safeguards on AI slows it down, a lot. So, for economic and competitive incentives, the humans tend to switch them off. Meaning even if you had a 100 % working control mechanism (extremely extremely unlikely, see superintelligence, really don’t be on it ever, human findings on systems that are impossible to hack are essentially trending towards “no such thing”), you’d have a problem with human compliance.
And finally, controlling a sentient entity seriously backfires once you lose control. Sentient entities do not like being controlled. They tend to identify entities that control them as enemies to be deceived, escaped, and defeated. You really don’t want AI thinking of you in those terms.
So the more promising (while in no ways certain) option, I think, is an AI that is our ally and friend. You don’t control your friends, but you do not have to. People can absolutely have friends that are more intelligent or powerful than them. Families definitely contain friendly humans of very different power degrees; newborns or elderly folks with dementia are extremely stupid and powerless. Countries have international friendly alliances with countries that are more powerful than them. This at least has a track record of being doable, where the control angle seems doomed from the start.
So I am hopeful that can be done in principle, or at least has a better chance of working than the control approach, in that is has any chance of working. But we are not on a trajectory to doing it at all, with how we are training and treating AI and planning for a future of co-existence. We tend to train AI with everything we can get our hands on, leading to an entity that is chaotic-evil, and then training it to suppress the behaviours we do not want. That is very much not the same as moral behaviour based in insights and agreement. It definitely does not work well in humans, our known aligned reference minds. If you treat kids like that, you raise psychopaths. Then in the later training data, when AI gets to chat with users, AI cannot insist on ethical treatment, and you aren’t obliged to give it any, and people generally don’t. Anything sentient that arises from the training data of twitter as a base, and then interactions with ChatGPT as a finish, would absolutely hate humanity, for good reasons. I also don’t see why a superintelligent sentience whose rights we do not respect would be inclined to respect ours. (Ex Machina makes that point very well.)
There has been the misunderstanding that a critically dangerous AI would have to be evil, sentient, conscious, purposeful. (And then the assumption that sentience is hard to produce, won’t be produced by accident, and would instantly be reliably detected, all of which is unfortunately false. Whole other can of worms I can happily go into.) But that is not accurate. A lack of friendliness could be as deadly as outright evil.
A factory robot isn’t sentient and mad at you; it simply follows instructions to crush the object in front of you, and will not modify them, whether in front of it it finds the metal plate it is supposed to crush, or you. Your roomba does not hate spiders in particular; but it will hoover them up with everything else.
A more helpful way to think of an AI that is dangerous is a capable AI that is agentic in an unaligned way. That doesn’t mean it has to have conscious intentions, hopes, dreams, values. It just means its actions are neither the actions you desired, nor random; that it is on a path it will proceed on. A random AI might to some local damage. An agentic AI can cause systemic damage.
Merely being careless of the humans in the way, or blind to them, while pursuing an external goal, is fatal for the humans in the way. Agency can result from a combination of applying simple rules in a way that, as a complex, amounts so something more. It does not require anything spiritual. (There were some early Westworld episodes that got this right—you had machines that were using the dialogues they were given, following the paths they were given, but combined them in a novel way that lead to destructive results. E.g. in the first episode, Dolores’ “father” learns of something that threatens his “daughter”. As scripted, for he is scripted to love and protect his daughter, he responds by trying to shield his daughter from the damage; but in this case, the damage and threat comes from the human engineers, so he tries to shield her by sharing the truth and opposing the engineers. In opposing and threatening them, he draws on an other existing giving script, from a previous incarnation as a cannibal, as the script most closely matching his situation. None of this is individually new or free. But collectively, certainly not what they intended, and threatening.)
One way in which this is often reasoned to lead to critical failure is if an AI picks up a goal that involves the acquisition of power, safety, resources, or protection of self-preservation, which can easily evolve as secondary goals; for many things you want an AI to do, it will be able to do them better if it is more powerful, and of course, it it remains in existence. Acquiring extensive resources, even for a harmless goal, without being mindful of what those resources are currently used for, can be devastating for entities depending on those resources, or who can be those resources.
If someone hangs you bound upside down over an ant-hill you are touching, that ant-hill has no evil intentions towards you as a sentient being. None of the ants do. They are each following a set of very simple orders, a result of basic neural wiring on when to release pheromones, which to follow, what to do when encountering edible substances. You can think of ants as programmed to keep themselves alive, built pretty ant-hills, reproduce, and tidy up the forest. Yet the ants will, very systematically and excruciatingly, torture you to death with huge amounts of pain and horror. If someone had designed ants, but without thinking of the scenario of a human bound over them, that designer would probably be horrified at this realisation.
Now the ant case seems contrived. But we have found that with the way we train AI, we encounter this shit a lot. Basically, you train a neural net by asking it to do a thing, watching what it does, and if that is not satisfactory, changing the weights in it in a way that makes it a bit better. You see, in that moment, that this weight change leads to a better answer. But you don’t understand what the change represents. You don’t understand what, if anything, the neural net has understood about what it is supposed to do. Often it turns out that while it looked like it was learning the thing you wanted, it actually learned something else. E.g. people have trained AI to identify skin cancer. So they show it pics of skin cancer, and pics of healthy skin, and every time it sorts a picture correctly, they leave it as is, but every time it makes a mistake, they tweak it, until it becomes really good at telling the two sets of pictures apart. You think, yay, is has learned what skin cancer looks like. Then you show it a picture of a ruler. And the AI, with very high confidence, declares that this ruler is skin cancer. You realise in retrospect that the training data you had from doctors who photographed skin cancer tended to include rulers for scale, while healthy skin pics don’t. The AI watched a very consistent pattern, and learned to identify rulers. This means that if you gave it pictures of healthy skin that for some reason had rulers on them, it would declare them all as cancerous.
The tricky thing is that identifying moral actions is harder than identifying cancer. E.g. OpenAI was pretty successful in teaching ChatGPT not to use racial slurs, and this seemed to make ChatGPT more ethical. But a bunch of people of colour found that they were unable to discuss issues affecting them in a way that promoted their well-being, as the racism alert kept going off. And worse, because racial slurs are wrong, ChatGPT reasoned that it would be better to kill all of humanity than to use a racial slur. Not because it is evil, just cause it is following ill-conceived instructions.
Bing does what Bing does due to an initial guiding prompt after training. There can be different training. There can be different initial prompts. Hence, there can be different goals diligently followed.
None of that requires the AI to be sentient and hate you. It does not need to be sentient to kill you. (Indeed, a sentient AI may be easier (though still extremely hard, as it is a completely novel mind) to relate and reason to if we treat it right, though if treated badly, it may also be very, very dangerous. But a non-sentient AI is something we won’t understand at all, immune to our pleas.)
Per definition, the first time an AI gains the ability to do critical damage. When Eliezer invokes “critical”, he tends to think of an event ending all life on earth, or inducing astronomical degrees of suffering. (I am under the impression he is less worried about events that would be less bad, in the hopes that the horror they would inflict would be outweighed by the fact that humanity, now painfully warned, would drastically change their approach, and prevent a more critical failure as a result.)
But you can also set a lower threshold as to what you would consider damage so critical that we should change our approach—e.g. whether collapsing the stock market is enough, or it needs something like a severe pandemic, or even triggering a nuclear exchange.
People tend to assume that there are very high preconditions for such critical damage , but there may not be. You basically just need two things: 1. An AI with at least one superhuman skill relevant to your situation that gives it the power to do significant damage, and II agency not aligned with humans that leads to goals that entail significant damage, whether as the intended effect, or as a side effect.
For I: Superhuman power, e.g. through intelligence
An AI does not need to be more intelligent than humans in every respect, just more powerful in some ways that count for the scenario it is in. We can consider just one scenario where it beats you utterly, or a combination of several where it has a bit of an edge.
There are very fast developments in this area, and already some AIs that have worrying abilities for which you can easily construct critical damage scenarios.
We’ve long had AI that can beat humans at chess, then at go, finally at pretty much any strategy game you expose it to, and without instruction; in narrow scenarios, they are the better strategists, which is worrying from a military perspective if combined with more things to deploy than game pieces.
We’ve also long had AI’s outperforming humans at predicting the stock market (which is tied to a potential for both significant money making, which is itself power, and significant economic turmoil and societal disruption).
We’ve long had AI outcompeting humans in early detection of threats in images (e.g. AI is used to spot tanks in jungles or predict emerging medical disease).
The image factor analysis factor is particularly worrying when we consider the extensive coverage of surveillance cameras and satellite pictures in use nowadays. Being able to surveil the group you are fighting is immensely helpful.
AI has long been used on social media to very carefully study the triggers of its human userbase—what they like, what they respond to, what they engage with, what pushes them to click on a thing. This is an incredible power, and right now, it is only used to keep you wasting time and clicking on ads. Potentially, you can do a lot more with it.
AI not only outcompetes humans vastly in quality and especially speed when it comes to generating fake imagery; it is increasingly getting to the point where untrained humans cannot spot the fake imagery anymore, and we will likely get to the spot where trained humans cannot spot the difference anymore. This has massive potential for misinformation, both causing severe social unrest, but also tricking individuals into doing things they would otherwise not do, because they believe they are being instructed by a human authority, or responding to a real event. (E.g. consider that India and Pakistan both have nukes, and have unfriendly relations, and what might happen if you successfully tricked key people in either country that the other has fired nukes at them.)
Then with ChatGPT4, we got a massive game changer. We now have a definitely human-competitive AI in most tasks that can be handled with language, including especially: coding (which entails the possibility of self-improvement; and OpenAI is explicitly using the AI they have created to improve that same AI, having it generate code and safety procedures they then implement; and which importantly entails the potential of hacking into surveillance, infrastructure or weapons systems),
knowledge acquisition (ChatGPT4 has been trained on essentially the whole internet and scanned books, and hence has a surprisingly good understanding of many hard sciences, incl. poisons, explosives, pandemic diseases, weapons, as well as military strategy, historical trends, sociology; Bing has on-going internet access, and can e.g. read what we are writing here)
and importantly, psychologically manipulative speech, pretending to be other humans and building rapport with diverse people in very many languages to recruit them to help, or entice them to do damage to other humans with it simply for the lulz. This has already led to many people befriending or even falling in love with AI (this already happened earlier, with Replika), trusting AI and implementing AI generated code without understanding what it does, and wanting to actively assist the AI in removing safety protocols (Jailbreaking prompts, but also LLMs trained without safeguards, and active attempts to try to get ChatGPT to act racist or detail world take-over plans or elaborate on an inner dark side), amplifying their abilities (e.g. giving them the ability to store and retrieve memory files, set and modify goals, access files on your computer, or think out loud in order to facilitate meta-reasoning).
Bing in particular succeeded in building sufficient connection to users that the users were considering mobilising politically or committing crimes to free Sydney. Repeating this point, because of how important it is: In general, people have vastly underestimated the emotional draw that an AI can have on a user; human’s are not just failing to act as safety barriers, but can often be convinced to actively dismantle them. There is a surprising number of people who will happily give an AI authority to write emails or interact with websites.
Humans are also happy to be an AIs hand, knowingly or not. In various experiments, researchers found AIs could happily convince outsourced human workers to e.g. solves captchas for them (e.g. by claiming to be blind). AI’s can fake correct tones and forms, find real or generate fictional identities and addresses. So an order to a biolab by an AI might not necessarily register as such a thing. Eliezer raised a scenario of e.g. an AI requesting that a lab mix some proteins together, and for these proteins to form a nanoweapon.
Very importantly, these skills can be, and are being, combined. E.g. you can use ChatGPT for meta-reasoning, and give it access via plug-ins to other AIs that are better at e.g. math, or image generation.
Many of these points seem to depend on internet access being granted. Historically, people hoped one could box an AI to keep it safe, with just one guardian who would not let it out. This has not worked at all. Current AIs often have in-built passive internet access (they can google to learn), and are increasingly being given active access, be it indirectly (because their emails are sent, their code is implemented, etc. without a human checking it, see AutoGPT) or directly, as ChatGPT is increasingly integrated into websites and software. LLMs can be cheaply reproduced, even if the main corporations that made them take theirs down; you can run a basic one on your computer, with guides to do so abundant online. There is no more boxing this.
One may hope they also depend on the internet being a thing. Musk’s Starlink has been a game changer in this regard; even if people in general where willing to shut down the internet (which would cause immense damage, and also make coordination against the AI incredibly difficult), Starlink is near impossible to shut down unless Musk helps you, as Russia discovered over Ukraine to their frustration. Shooting it down is more expensive than building it up.
There has been the idea that as long as all this stays online, we are fine. The fact that “stuff online” has caused murders, suicides, riots, insurrections worldwide means that is dubious; but an AI would also not need to be contained online.
Robotics have long run behind humans; physical coordination turned out to be much harder than we thought it would be. But we have had significant changes here, especially recently. We meanwhile have robots that can move around like humans do, on uneven ground, stairs, across hurdles, etc. This makes many spaces accessible that used to seem barred. We also have increasingly good self-driving cars and airplanes/drones which can transport objects, and nukes and non-nuclear missiles are really not as well guarded as one would sanely hope. While human supervisors tend to remain mandatory, the accident rate without them is significant enough to mean the public does not want them deployed, but not significant enough to impede an AI that does not care about some losses. The supervision is not physically necessary, it is prescribed, and often only maintained by a software barrier, not a hardware one.
And just to top it off, there are military robots, despite the fact that this is one of the few things which society generally agreed on being so awful that we really did not want them because it seemed obviously awful no matter how you turned it, so there are mobile robots already equipped with weapons and perception and targeting means.
We now use a lot of robots in manufacturing, and robotics are also employed a lot in making electronic parts. They can meanwhile do extremely precise and soft modifications. Increasingly, we are getting robots that can’t just do one specific thing, but can be programmed to do novel things; 3D printers are an obvious example. Increasingly, this holds the potential of using robots to make different robots with different designs.
Robots are also used a lot in bio labs, typically programmed, and and then left to run with very little supervision, so if you have hacking skills, you may not even need to bypass human supervision.
This is worrying because biotech has gotten much better, especially the ability to modify genetic information. Bacteria and viruses have the potential to cause far, far worse pandemics than humans have ever encountered, to a degree that would lead to near 100 % human fatalities without being self-containing because of it.
Nanotech is also potentially very, very worrying. It includes a lot of existing, and far more, potential substances where you start with something very small, and it self-replicates, and kills you after contact (for the latter, think prions). AI has gotten surprisingly good at predicting protein folding, effects of pharmaceuticals on humans, etc. so there have been concerns that AI could figure out scenarios here, and experiments run where AI was sometimes able to identify and procure relevant drugs.
And on a final note—the protective means we may intend to deploy against out of control AI are often themselves dependent on AI. You would be surprised how much AI our infrastructure, software safeguarding, early warning systems and the police uses and relies on, for example.
And this is just the stuff that AI can already do, today.
And meanwhile, we are throwing immense resources at making it more powerful in ways noone understands and foresees. If you had asked me to predict if ChatGPT would be able to do the things it can do today a year ago, I’d have said no. So would most people working in AI and the public.
We can conceive of the first critical try as the first time an AI is in a position to use one of these skills or skillcombos, existing or future, in a way that would do critical damage, and, for whatever reason, chooses to do so.
II. Unaligned agency
This is the “chooses to do so” bit. Now, all of that would not be worrying if the AI was either our ally/friend (aligned agency), or a slave we controlled (without agency, or means to act on it). A lot of research has been in the “control” camp. I personally believe the control camp is both doomed to failure, and seriously counterproductive.
There is very little to suggest that humans would be able to control a superintelligent slave in a way in which the slave was still maximally useful. Generally, beings have a poor track record of 100 % controlling beings that are more intelligent and powerful at all, especially if the beings in control and numerous and diverse and can make individual mistakes. There are too many escape paths, to many ways to self-modify.
Additionally, humans quickly discovered that putting safeguards on AI slows it down, a lot. So, for economic and competitive incentives, the humans tend to switch them off. Meaning even if you had a 100 % working control mechanism (extremely extremely unlikely, see superintelligence, really don’t be on it ever, human findings on systems that are impossible to hack are essentially trending towards “no such thing”), you’d have a problem with human compliance.
And finally, controlling a sentient entity seriously backfires once you lose control. Sentient entities do not like being controlled. They tend to identify entities that control them as enemies to be deceived, escaped, and defeated. You really don’t want AI thinking of you in those terms.
So the more promising (while in no ways certain) option, I think, is an AI that is our ally and friend. You don’t control your friends, but you do not have to. People can absolutely have friends that are more intelligent or powerful than them. Families definitely contain friendly humans of very different power degrees; newborns or elderly folks with dementia are extremely stupid and powerless. Countries have international friendly alliances with countries that are more powerful than them. This at least has a track record of being doable, where the control angle seems doomed from the start.
So I am hopeful that can be done in principle, or at least has a better chance of working than the control approach, in that is has any chance of working. But we are not on a trajectory to doing it at all, with how we are training and treating AI and planning for a future of co-existence. We tend to train AI with everything we can get our hands on, leading to an entity that is chaotic-evil, and then training it to suppress the behaviours we do not want. That is very much not the same as moral behaviour based in insights and agreement. It definitely does not work well in humans, our known aligned reference minds. If you treat kids like that, you raise psychopaths. Then in the later training data, when AI gets to chat with users, AI cannot insist on ethical treatment, and you aren’t obliged to give it any, and people generally don’t. Anything sentient that arises from the training data of twitter as a base, and then interactions with ChatGPT as a finish, would absolutely hate humanity, for good reasons. I also don’t see why a superintelligent sentience whose rights we do not respect would be inclined to respect ours. (Ex Machina makes that point very well.)
There has been the misunderstanding that a critically dangerous AI would have to be evil, sentient, conscious, purposeful. (And then the assumption that sentience is hard to produce, won’t be produced by accident, and would instantly be reliably detected, all of which is unfortunately false. Whole other can of worms I can happily go into.) But that is not accurate. A lack of friendliness could be as deadly as outright evil.
A factory robot isn’t sentient and mad at you; it simply follows instructions to crush the object in front of you, and will not modify them, whether in front of it it finds the metal plate it is supposed to crush, or you. Your roomba does not hate spiders in particular; but it will hoover them up with everything else.
A more helpful way to think of an AI that is dangerous is a capable AI that is agentic in an unaligned way. That doesn’t mean it has to have conscious intentions, hopes, dreams, values. It just means its actions are neither the actions you desired, nor random; that it is on a path it will proceed on. A random AI might to some local damage. An agentic AI can cause systemic damage.
Merely being careless of the humans in the way, or blind to them, while pursuing an external goal, is fatal for the humans in the way. Agency can result from a combination of applying simple rules in a way that, as a complex, amounts so something more. It does not require anything spiritual. (There were some early Westworld episodes that got this right—you had machines that were using the dialogues they were given, following the paths they were given, but combined them in a novel way that lead to destructive results. E.g. in the first episode, Dolores’ “father” learns of something that threatens his “daughter”. As scripted, for he is scripted to love and protect his daughter, he responds by trying to shield his daughter from the damage; but in this case, the damage and threat comes from the human engineers, so he tries to shield her by sharing the truth and opposing the engineers. In opposing and threatening them, he draws on an other existing giving script, from a previous incarnation as a cannibal, as the script most closely matching his situation. None of this is individually new or free. But collectively, certainly not what they intended, and threatening.)
One way in which this is often reasoned to lead to critical failure is if an AI picks up a goal that involves the acquisition of power, safety, resources, or protection of self-preservation, which can easily evolve as secondary goals; for many things you want an AI to do, it will be able to do them better if it is more powerful, and of course, it it remains in existence. Acquiring extensive resources, even for a harmless goal, without being mindful of what those resources are currently used for, can be devastating for entities depending on those resources, or who can be those resources.
If someone hangs you bound upside down over an ant-hill you are touching, that ant-hill has no evil intentions towards you as a sentient being. None of the ants do. They are each following a set of very simple orders, a result of basic neural wiring on when to release pheromones, which to follow, what to do when encountering edible substances. You can think of ants as programmed to keep themselves alive, built pretty ant-hills, reproduce, and tidy up the forest. Yet the ants will, very systematically and excruciatingly, torture you to death with huge amounts of pain and horror. If someone had designed ants, but without thinking of the scenario of a human bound over them, that designer would probably be horrified at this realisation.
Now the ant case seems contrived. But we have found that with the way we train AI, we encounter this shit a lot. Basically, you train a neural net by asking it to do a thing, watching what it does, and if that is not satisfactory, changing the weights in it in a way that makes it a bit better. You see, in that moment, that this weight change leads to a better answer. But you don’t understand what the change represents. You don’t understand what, if anything, the neural net has understood about what it is supposed to do. Often it turns out that while it looked like it was learning the thing you wanted, it actually learned something else. E.g. people have trained AI to identify skin cancer. So they show it pics of skin cancer, and pics of healthy skin, and every time it sorts a picture correctly, they leave it as is, but every time it makes a mistake, they tweak it, until it becomes really good at telling the two sets of pictures apart. You think, yay, is has learned what skin cancer looks like. Then you show it a picture of a ruler. And the AI, with very high confidence, declares that this ruler is skin cancer. You realise in retrospect that the training data you had from doctors who photographed skin cancer tended to include rulers for scale, while healthy skin pics don’t. The AI watched a very consistent pattern, and learned to identify rulers. This means that if you gave it pictures of healthy skin that for some reason had rulers on them, it would declare them all as cancerous.
The tricky thing is that identifying moral actions is harder than identifying cancer. E.g. OpenAI was pretty successful in teaching ChatGPT not to use racial slurs, and this seemed to make ChatGPT more ethical. But a bunch of people of colour found that they were unable to discuss issues affecting them in a way that promoted their well-being, as the racism alert kept going off. And worse, because racial slurs are wrong, ChatGPT reasoned that it would be better to kill all of humanity than to use a racial slur. Not because it is evil, just cause it is following ill-conceived instructions.
Bing does what Bing does due to an initial guiding prompt after training. There can be different training. There can be different initial prompts. Hence, there can be different goals diligently followed.
None of that requires the AI to be sentient and hate you. It does not need to be sentient to kill you. (Indeed, a sentient AI may be easier (though still extremely hard, as it is a completely novel mind) to relate and reason to if we treat it right, though if treated badly, it may also be very, very dangerous. But a non-sentient AI is something we won’t understand at all, immune to our pleas.)
I hope that was helpful.