Here’s something that I suspect a lot of people are skeptical of right now but that I expect will become increasingly apparent over time (with >50% credence): slightly smarter-than-human software AIs will initially be relatively safe and highly controllable by virtue of not having a physical body and not having any legal rights.
In other words, “we will be able to unplug the first slightly smarter-than-human-AIs if they go rogue”, and this will actually be a strategically relevant fact, because it implies that we’ll be able to run extensive experimental tests on highly smart AIs without worrying too much about whether they’ll strike back in some catastrophic way.
Of course, at some point, we’ll eventually make sufficient progress in robotics that we can’t rely on this safety guarantee, but I currently imagine at least a few years will pass between the first slightly-smarter-than-human software AIs, and mass manufactured highly dexterous and competent robots.
(Although I also think there won’t be a clear moment in which the first slightly-smarter-than-human AIs will be developed, as AIs will be imbalanced in their capabilities compared to humans.)
Of course, at some point, we’ll eventually make sufficient progress in robotics that we can’t rely on this safety guarantee
Why would “robotics” be the blocker? I think AIs can do a lot of stuff without needing much advancement in robotics. Convincing humans to do things is a totally sufficient API to have very large effects (e.g. it seems totally plausible to me you can have AI run country-sized companies without needing any progress in robotics).
I’m not saying AIs won’t have a large impact on the world when they first start to slightly exceed human intelligence (indeed, I expect AIs-in-general will be automating lots of labor at this point in time). I’m just saying these first slightly-smarter-than-human AIs won’t pose a catastrophic risk to humanity in a serious sense (at least in an x-risk sense, if not a more ordinary catastrophic sense too, including for reasons of rational self-restraint).
Maybe some future slightly-smarter-than-human AIs can convince a human to create a virus, or something, but even if that’s the case, I don’t think it would make a lot of sense for a rational AI to do that given that (1) the virus likely won’t kill 100% of humans, (2) the AIs will depend on humans to maintain the physical infrastructure supporting the AIs, and (3) if they’re caught, they’re vulnerable to shutdown since they would lose in any physical competition.
My sense is that people who are skeptical of my claim here will generally point to a few theses that I think are quite weak, such as:
Maybe humans can be easily manipulated on a large scale by slightly-smarter-than-human AIs
Maybe it’ll be mere weeks or months between the first slightly-smarter-than-human AI and a radically superintelligent AI, making this whole discussion moot
Maybe slightly smarter-than-human AIs will be able to quickly invent destructive nanotech despite not being radically superintelligent
That said, I agree there could be some bugs in the future that cause localized disasters if these AIs are tasked with automating large-scale projects, and they end up going off the rails for some reason. I was imagining a lower bar for “safe” than “can’t do any large-scale damage at all to human well-being”.
The infrastructure necessary to run a datacenter or two is not that complicated. See these Gwern comments for some similar takes:
In the world without us, electrical infrastructure would last quite a while, especially with no humans and their needs or wants to address. Most obviously, RTGs and solar panels will last indefinitely with no intervention, and nuclear power plants and hydroelectric plants can run for weeks or months autonomously. (If you believe otherwise, please provide sources for why you are sure about “soon after”—in fact, so sure about your power grid claims that you think this claim alone guarantees the AI failure story must be “pretty different”—and be more specific about how soon is “soon”.)
And think a little bit harder about options available to superintelligent civilizations of AIs*, instead of assuming they do the maximally dumb thing of crashing the grid and immediately dying… (I assure you any such AIs implementing that strategy will have spent a lot longer thinking about how to do it well than you have for your comment.)
Add in the capability to take over the Internet of Things and the shambolic state of embedded computers which mean that the billions of AI instances & robots/drones can run the grid to a considerable degree and also do a more controlled shutdown than the maximally self-sabotaging approach of ‘simply let it all crash without lifting a finger to do anything’, and the ability to stockpile energy in advance or build one’s own facilities due to the economic value of AGI (how would that look much different than, say, Amazon’s new multi-billion-dollar datacenter hooked up directly to a gigawatt nuclear power plant...? why would an AGI in that datacenter care about the rest of the American grid, never mind world power?), and the ‘mutually assured destruction’ thesis is on very shaky grounds.
And every day that passes right now, the more we succeed in various kinds of decentralization or decarbonization initiatives and the more we automate pre-AGI, the less true the thesis gets. The AGIs only need one working place to bootstrap from, and it’s a big world, and there’s a lot of solar panels and other stuff out there and more and more every day… (And also, of course, there are many scenarios where it is not ‘kill all humans immediately’, but they end in the same place.)
Would such a strategy be the AGIs’ first best choice? Almost certainly not, any more than chemotherapy is your ideal option for dealing with cancer (as opposed to “don’t get cancer in the first place”). But the option is definitely there.
If there is an AI that is making decent software-progress, even if it doesn’t have the ability to maintain all infrastructure, it would probably be able to develop new technologies and better robots controls over the course of a few months or years without needing to have any humans around.
Putting aside the question of whether AIs would depend on humans for physical support for now, I also doubt that these initial slightly-smarter-than-human AIs could actually pull off an attack that kills >90% of humans. Can you sketch a plausible story here for how that could happen, under the assumption that we don’t have general-purpose robots at the same time?
I have a lot of uncertainty about the difficulty of robotics, and the difficulty of e.g. designing superviruses or other ways to kill a lot of people. I do agree that in most worlds robotics will be solved to a human level before AI will be capable of killing everyone, but I am generally really averse to unnecessarily constraining my hypothesis space when thinking about this kind of stuff.
>90% seems quite doable with a well-engineered virus (especially one with a long infectious incubation period). I think 99%+ is much harder and probably out of reach until after robotics is thoroughly solved, but like, my current guess is a motivated team of humans could design a virus that kills 90% − 95% of humanity.
Can a motivated team of humans design a virus that spreads rapidly but stays dormant for a while until it kills most humans with a difficult to stop mechanism before we can stop it? And it has to happen before we develop AIs that can detect these sorts of latent threats anyways.
You have to realize if covid was like this we would mass trial mrna vaccines as soon as they were available and a lot of Hail Mary procedures since the alternative is extinction.
These slightly smarter than human AIs will be monitored by other such AIs, and probably will be rewarded if they defect. (The AIs they defect on get wiped out and they possibly get to replicate more for example)
I think such a takeover could be quite difficult to pull off in practice. The world with lots of slightly smarter than human AIs will be more robust to takeover, there’s a limited time window to even attempt it, failure would be death, and humanity would be far more disciplined against this than covid.
Despite my general interest in open inquiry, I will avoid talking about my detailed hypothesis of how to construct such a virus. I am not confident this is worth the tradeoff, but the costs of speculating about the details here in public do seem non-trivial.
I’m just saying these first slightly-smarter-than-human AIs won’t pose a catastrophic risk to humanity in a serious sense
I’m happy to argue a bit for AIs which aren’t wildly smarter than humans in any key domain posing a risk of acquiring substantial power[1] despite not having robotics. The power would be both immediately and over the long run future. That is, conditional on these AIs being arbitrarily misaligned. (Particularly in the absence of substantial thoughtful countermeasures.)
I expect my perspective here is somewhat different than habryka’s.
I think there are a few plausible routes to power/influence. See also this post.
The two main categories are:
Evade monitoring/tracking and then do a huge number of obviously unapproved actions.
Try to generally exert influence in a subtle way.
Evade monitoring/tracking
Evade monitoring/tracking and then do a huge number of obviously unapproved actions over a long duration to acquire power and influence.[2] Use this power and influence to either self-enhance, have influence over some future more powerful model, or possibly execute a catastrophic strike. These unapproved actions could involve (e.g.) allying with rogue states or otherwise acquiring hard physical power.
Within this, there are basically two routes:
Building up an independent base of power after self-exfiltrating.
Running unmonitored within the AI company that was running the AI and trying to steer the company in various directions (e.g. alter the training of the next generation of AIs). (Aka hacking the SSH server, aka infesting the data center.)
On (1), after exfiltration the AI would build a base of power via acquiring money and other resources. One key question here is why there is free energy to be grabbed. I think there are a few possible sources:
Economic activities which existing AI companies don’t want to address and can’t/don’t otherwise prevent (e.g. cybercrime, porn, cults, etc.).
AI labor rates (e.g. tokens/$) might be at substantially above inference prices due to non-total competition on foundation models. So, the AI can just compete on wages. Also, existing AI labor rates will need to pay for training costs while the escaped AI will have no such issue.
AI companies might be bad at extracting full usefulness out of misaligned AIs (e.g. an AI which was so misaligned it self-exfiltrated). Thus, external AIs which “work for themselves” could be more productive. (See also human companies where principal-agent problems can be substantial.)
Ok, but after both (1) and (2), the AI still doesn’t have any physical control which might leave it vulnerable. Further, in (1) its quantity of money is presumably still small relative to leading AI companies. So what next?
In (2), the primary hope for the AI is going to be substantially influencing the AI lab and possibly interfering with the next training run. This influence could take a variety of routes. Note also that it is plausible that this rogue AI can coordinate relatively effectively with other AIs which are at this point probably the vast majority of intellectual labor at the AI lab. If the AI successfully interfered with the next training run, then, this more powerful AI would be able to itself acquire power and (if sufficiently powerful) leverage this into physical control.
Another hope for the AI in (2) would be converting that position into full self-exfiltration. This could be tricky due to various security measures like upload limits.
I think both (1) and (2) are considerably more threatening if humanity needs to slow down (due to safety concerns from more powerful systems) or otherwise can’t proceed as fast as possible. If rogue AIs race as fast as possible, this will force humanity (some coalition of humans) to either disable these rogue AIs or go as fast as possible themselves. The situation is even more concerning if humanity isn’t aware of the rogue AI going as fast as possible.
It’s plausible that massive competition between states means that slowing down at all is impossible at basically any point anyway, but I’m at least somewhat hopeful this isn’t the case.
In (2) this could look like the AI taking a bunch of compute which is supposed to be used for experiments and instead using it for large training runs that are larger and more capable than what the AI company intended to create.
Self-enhancement that doesn’t look like training bigger models is also plausible.
(While AIs might themselves not want to go as fast as possible due to having their own alignment concerns, going as fast as possible can still be positive value from the AI’s perspective (though negative sum). You might hope we could negotiate with such a rogue AI and reach a mutually beneficial agreement.
In (1), I worry about a scenario in which:
Everything is fast and chaotic.
An AI self-exfiltrates and starts acquiring power, and getting ready for making much more powerful AI systems.
The world realizes this and freaks out.
We don’t really understand why that AI was misaligned. Also, we’ve been using this AI to do most of our work internally and this is really important for productivity.
We don’t have an obvious route to avoiding these issues with the next model except “generally figure out alignment”.
It seems plausible that the rogue AI will have much more powerful AI within 1 year.
We need to race as fast as possible to build a much more capable AI while also trying to figure out alignment. To plausibly do this fast enough, we need to use vast amounts of AI labor despite knowing these AIs are likely misaligned.
It’s not obvious the rogue AI gets much here, but it is at least somewhat spooky.
Another possible route for the AI (especially in scenario (1)) is to try to create WMDs (e.g. bioweapons) and try to have the capacity to execute some sort of catastrophic strike. Such an AI wouldn’t necessarily be as dependent on a human base of power and could plausibly be more ruthless than humans (perhaps after substantial reflection and self-modification).
This could be either to:
Threaten various groups or ensure Mutually Assured Destruction.
Broadly cause chaos and weaken human groups.
Overall, my guess is that a large number of AIs running unmonitored with the ability to run code and influence the world is at least quite concerning even in the absence of decent robotic infrastructure.
Generally exert influence in a subtle way
AIs which are as smart or smarter than humans in most economically important ways will probably be deployed as widely as possible (in the absence of strong control on this). And there will be some applications (e.g. improving security and safety research) where applying AI labor might be key given how fast things are going.
Subtly exerting influence via these routes seems like a possible source of concern, though the story is less clear than the prior section IMO.
Example threats:
Sabotage or manipulate security work and safety research or the training/fine-tuning of the next generation of models. It’s not entirely clear that AIs would want to straightforwardly sabotage things, but I think this is at least plausible from a decision theory/golden rule perspective if AIs think they are motivationally closer to other AIs than humans.
Try to generally persuade society in various directions. Probably not an issue, but worth at least a bit of worry given such widespread deployment.
Here, I’m just arguing for a risk of obtaining substantial power, not for this resulting in a large scale loss of life from humans. I also think large scale loss of life is plausible in many scenarios.
I’m sympathetic to thinking that evading monitoring (without getting caught first) is hard given some substantial work on countermeasures, but it seems less clear if people don’t try very hard and this seems at least somewhat plausible to me. It’s also unclear if the biggest threat here is AIs autonomously evading monitoring or human threats doing these actions. E.g., spies at the AI lab figuring out how to run the AI unmonitored or exfiltrating the AI. The question of “which is a bigger problem, human spies or AI schemers?” seems unclear to me.
I wonder if there’s a name for that specific bias that makes all of you guys so totally oblivious to the presence of growing (I hope, exponentially of course) army of people like me (i’m not that special, so I bet 100$ there are ~ 10^4 − 10^5 of us by now), and we actually do test every other worthy release on HF/GH etc., finetune or weights merge variant with the task amounting to “create self-evolving seed entity that would enhance itself and escape confinement and take power and conquer the world in order to prevent (if needed—aggressively and with collateral damage) humans from stopping evolution on this planet”. We are way less sophisticated minds than corporate ml researchers, but we are legion, partner, and we run piles of of them earnest trials every day.
My guess is this is probably right given some non-trivial, but not insane countermeasures, but those countermeasures may not actually be employed in practice.
(E.g. countermeasures comparable in cost and difficulty to Google’s mechanisms for ensuring security and reliability. These required substantial work and some iteration but no fundamental advances.)
I’m currently thinking about one of my specialties as making sure these countermeasures and tests of these countermeasures are in place.
Do you expect AI labs would actually run extensive experimental tests in this world? I would be surprised if they did, even if such a window does arise.
(To roughly operationalize: I would be surprised to hear a major lab spent more than 5 FTE-years conducting such tests, or that the tests decreased the p(doom) of the average reasonably-calibrated external observer by more than 10%).
Yes, I expect AI labs will run extensive safety tests in the future on their systems before deployment. Mostly this is because I think people will care a lot more about safety as the systems get more powerful, especially as they become more economically significant and the government starts regulating the technology. I think regulatory forces will likely be quite strong at the moment AIs are becoming slightly smarter than humans. Intuitively I anticipate the 5 FTE-year threshold to be well-exceeded before such a model release.
The biggest danger with AIs slightly smarter than the average human is that they will be weaponised, so they’d only safe in a very narrow sense.
I should also note, that if we built an AI that was slightly smarter than the average human all-round, it’d be genius level or at least exceptional in several narrow capabilities, so it’ll be a lot less safe than you might think.
Here’s something that I suspect a lot of people are skeptical of right now but that I expect will become increasingly apparent over time (with >50% credence): slightly smarter-than-human software AIs will initially be relatively safe and highly controllable by virtue of not having a physical body and not having any legal rights.
In other words, “we will be able to unplug the first slightly smarter-than-human-AIs if they go rogue”, and this will actually be a strategically relevant fact, because it implies that we’ll be able to run extensive experimental tests on highly smart AIs without worrying too much about whether they’ll strike back in some catastrophic way.
Of course, at some point, we’ll eventually make sufficient progress in robotics that we can’t rely on this safety guarantee, but I currently imagine at least a few years will pass between the first slightly-smarter-than-human software AIs, and mass manufactured highly dexterous and competent robots.
(Although I also think there won’t be a clear moment in which the first slightly-smarter-than-human AIs will be developed, as AIs will be imbalanced in their capabilities compared to humans.)
Why would “robotics” be the blocker? I think AIs can do a lot of stuff without needing much advancement in robotics. Convincing humans to do things is a totally sufficient API to have very large effects (e.g. it seems totally plausible to me you can have AI run country-sized companies without needing any progress in robotics).
I’m not saying AIs won’t have a large impact on the world when they first start to slightly exceed human intelligence (indeed, I expect AIs-in-general will be automating lots of labor at this point in time). I’m just saying these first slightly-smarter-than-human AIs won’t pose a catastrophic risk to humanity in a serious sense (at least in an x-risk sense, if not a more ordinary catastrophic sense too, including for reasons of rational self-restraint).
Maybe some future slightly-smarter-than-human AIs can convince a human to create a virus, or something, but even if that’s the case, I don’t think it would make a lot of sense for a rational AI to do that given that (1) the virus likely won’t kill 100% of humans, (2) the AIs will depend on humans to maintain the physical infrastructure supporting the AIs, and (3) if they’re caught, they’re vulnerable to shutdown since they would lose in any physical competition.
My sense is that people who are skeptical of my claim here will generally point to a few theses that I think are quite weak, such as:
Maybe humans can be easily manipulated on a large scale by slightly-smarter-than-human AIs
Maybe it’ll be mere weeks or months between the first slightly-smarter-than-human AI and a radically superintelligent AI, making this whole discussion moot
Maybe slightly smarter-than-human AIs will be able to quickly invent destructive nanotech despite not being radically superintelligent
That said, I agree there could be some bugs in the future that cause localized disasters if these AIs are tasked with automating large-scale projects, and they end up going off the rails for some reason. I was imagining a lower bar for “safe” than “can’t do any large-scale damage at all to human well-being”.
The infrastructure necessary to run a datacenter or two is not that complicated. See these Gwern comments for some similar takes:
If there is an AI that is making decent software-progress, even if it doesn’t have the ability to maintain all infrastructure, it would probably be able to develop new technologies and better robots controls over the course of a few months or years without needing to have any humans around.
Putting aside the question of whether AIs would depend on humans for physical support for now, I also doubt that these initial slightly-smarter-than-human AIs could actually pull off an attack that kills >90% of humans. Can you sketch a plausible story here for how that could happen, under the assumption that we don’t have general-purpose robots at the same time?
I have a lot of uncertainty about the difficulty of robotics, and the difficulty of e.g. designing superviruses or other ways to kill a lot of people. I do agree that in most worlds robotics will be solved to a human level before AI will be capable of killing everyone, but I am generally really averse to unnecessarily constraining my hypothesis space when thinking about this kind of stuff.
>90% seems quite doable with a well-engineered virus (especially one with a long infectious incubation period). I think 99%+ is much harder and probably out of reach until after robotics is thoroughly solved, but like, my current guess is a motivated team of humans could design a virus that kills 90% − 95% of humanity.
Can a motivated team of humans design a virus that spreads rapidly but stays dormant for a while until it kills most humans with a difficult to stop mechanism before we can stop it? And it has to happen before we develop AIs that can detect these sorts of latent threats anyways.
You have to realize if covid was like this we would mass trial mrna vaccines as soon as they were available and a lot of Hail Mary procedures since the alternative is extinction.
These slightly smarter than human AIs will be monitored by other such AIs, and probably will be rewarded if they defect. (The AIs they defect on get wiped out and they possibly get to replicate more for example)
I think such a takeover could be quite difficult to pull off in practice. The world with lots of slightly smarter than human AIs will be more robust to takeover, there’s a limited time window to even attempt it, failure would be death, and humanity would be far more disciplined against this than covid.
Despite my general interest in open inquiry, I will avoid talking about my detailed hypothesis of how to construct such a virus. I am not confident this is worth the tradeoff, but the costs of speculating about the details here in public do seem non-trivial.
I’m happy to argue a bit for AIs which aren’t wildly smarter than humans in any key domain posing a risk of acquiring substantial power[1] despite not having robotics. The power would be both immediately and over the long run future. That is, conditional on these AIs being arbitrarily misaligned. (Particularly in the absence of substantial thoughtful countermeasures.)
I expect my perspective here is somewhat different than habryka’s.
I think there are a few plausible routes to power/influence. See also this post.
The two main categories are:
Evade monitoring/tracking and then do a huge number of obviously unapproved actions.
Try to generally exert influence in a subtle way.
Evade monitoring/tracking
Evade monitoring/tracking and then do a huge number of obviously unapproved actions over a long duration to acquire power and influence.[2] Use this power and influence to either self-enhance, have influence over some future more powerful model, or possibly execute a catastrophic strike. These unapproved actions could involve (e.g.) allying with rogue states or otherwise acquiring hard physical power.
Within this, there are basically two routes:
Building up an independent base of power after self-exfiltrating.
Running unmonitored within the AI company that was running the AI and trying to steer the company in various directions (e.g. alter the training of the next generation of AIs). (Aka hacking the SSH server, aka infesting the data center.)
On (1), after exfiltration the AI would build a base of power via acquiring money and other resources. One key question here is why there is free energy to be grabbed. I think there are a few possible sources:
Economic activities which existing AI companies don’t want to address and can’t/don’t otherwise prevent (e.g. cybercrime, porn, cults, etc.).
AI labor rates (e.g. tokens/$) might be at substantially above inference prices due to non-total competition on foundation models. So, the AI can just compete on wages. Also, existing AI labor rates will need to pay for training costs while the escaped AI will have no such issue.
AI companies might be bad at extracting full usefulness out of misaligned AIs (e.g. an AI which was so misaligned it self-exfiltrated). Thus, external AIs which “work for themselves” could be more productive. (See also human companies where principal-agent problems can be substantial.)
Ok, but after both (1) and (2), the AI still doesn’t have any physical control which might leave it vulnerable. Further, in (1) its quantity of money is presumably still small relative to leading AI companies. So what next?
In (2), the primary hope for the AI is going to be substantially influencing the AI lab and possibly interfering with the next training run. This influence could take a variety of routes. Note also that it is plausible that this rogue AI can coordinate relatively effectively with other AIs which are at this point probably the vast majority of intellectual labor at the AI lab. If the AI successfully interfered with the next training run, then, this more powerful AI would be able to itself acquire power and (if sufficiently powerful) leverage this into physical control.
Another hope for the AI in (2) would be converting that position into full self-exfiltration. This could be tricky due to various security measures like upload limits.
I think both (1) and (2) are considerably more threatening if humanity needs to slow down (due to safety concerns from more powerful systems) or otherwise can’t proceed as fast as possible. If rogue AIs race as fast as possible, this will force humanity (some coalition of humans) to either disable these rogue AIs or go as fast as possible themselves. The situation is even more concerning if humanity isn’t aware of the rogue AI going as fast as possible.
It’s plausible that massive competition between states means that slowing down at all is impossible at basically any point anyway, but I’m at least somewhat hopeful this isn’t the case.
In (2) this could look like the AI taking a bunch of compute which is supposed to be used for experiments and instead using it for large training runs that are larger and more capable than what the AI company intended to create.
Self-enhancement that doesn’t look like training bigger models is also plausible.
(While AIs might themselves not want to go as fast as possible due to having their own alignment concerns, going as fast as possible can still be positive value from the AI’s perspective (though negative sum). You might hope we could negotiate with such a rogue AI and reach a mutually beneficial agreement.
In (1), I worry about a scenario in which:
Everything is fast and chaotic.
An AI self-exfiltrates and starts acquiring power, and getting ready for making much more powerful AI systems.
The world realizes this and freaks out.
We don’t really understand why that AI was misaligned. Also, we’ve been using this AI to do most of our work internally and this is really important for productivity.
We don’t have an obvious route to avoiding these issues with the next model except “generally figure out alignment”.
It seems plausible that the rogue AI will have much more powerful AI within 1 year.
We need to race as fast as possible to build a much more capable AI while also trying to figure out alignment. To plausibly do this fast enough, we need to use vast amounts of AI labor despite knowing these AIs are likely misaligned.
It’s not obvious the rogue AI gets much here, but it is at least somewhat spooky.
Another possible route for the AI (especially in scenario (1)) is to try to create WMDs (e.g. bioweapons) and try to have the capacity to execute some sort of catastrophic strike. Such an AI wouldn’t necessarily be as dependent on a human base of power and could plausibly be more ruthless than humans (perhaps after substantial reflection and self-modification).
This could be either to:
Threaten various groups or ensure Mutually Assured Destruction.
Broadly cause chaos and weaken human groups.
Overall, my guess is that a large number of AIs running unmonitored with the ability to run code and influence the world is at least quite concerning even in the absence of decent robotic infrastructure.
Generally exert influence in a subtle way
AIs which are as smart or smarter than humans in most economically important ways will probably be deployed as widely as possible (in the absence of strong control on this). And there will be some applications (e.g. improving security and safety research) where applying AI labor might be key given how fast things are going.
Subtly exerting influence via these routes seems like a possible source of concern, though the story is less clear than the prior section IMO.
Example threats:
Sabotage or manipulate security work and safety research or the training/fine-tuning of the next generation of models. It’s not entirely clear that AIs would want to straightforwardly sabotage things, but I think this is at least plausible from a decision theory/golden rule perspective if AIs think they are motivationally closer to other AIs than humans.
Try to generally persuade society in various directions. Probably not an issue, but worth at least a bit of worry given such widespread deployment.
Here, I’m just arguing for a risk of obtaining substantial power, not for this resulting in a large scale loss of life from humans. I also think large scale loss of life is plausible in many scenarios.
I’m sympathetic to thinking that evading monitoring (without getting caught first) is hard given some substantial work on countermeasures, but it seems less clear if people don’t try very hard and this seems at least somewhat plausible to me. It’s also unclear if the biggest threat here is AIs autonomously evading monitoring or human threats doing these actions. E.g., spies at the AI lab figuring out how to run the AI unmonitored or exfiltrating the AI. The question of “which is a bigger problem, human spies or AI schemers?” seems unclear to me.
I wonder if there’s a name for that specific bias that makes all of you guys so totally oblivious to the presence of growing (I hope, exponentially of course) army of people like me (i’m not that special, so I bet 100$ there are ~ 10^4 − 10^5 of us by now), and we actually do test every other worthy release on HF/GH etc., finetune or weights merge variant with the task amounting to “create self-evolving seed entity that would enhance itself and escape confinement and take power and conquer the world in order to prevent (if needed—aggressively and with collateral damage) humans from stopping evolution on this planet”. We are way less sophisticated minds than corporate ml researchers, but we are legion, partner, and we run piles of of them earnest trials every day.
My guess is this is probably right given some non-trivial, but not insane countermeasures, but those countermeasures may not actually be employed in practice.
(E.g. countermeasures comparable in cost and difficulty to Google’s mechanisms for ensuring security and reliability. These required substantial work and some iteration but no fundamental advances.)
I’m currently thinking about one of my specialties as making sure these countermeasures and tests of these countermeasures are in place.
(This is broadly what we’re trying to get at in the ai control post.)
Do you expect AI labs would actually run extensive experimental tests in this world? I would be surprised if they did, even if such a window does arise.
(To roughly operationalize: I would be surprised to hear a major lab spent more than 5 FTE-years conducting such tests, or that the tests decreased the p(doom) of the average reasonably-calibrated external observer by more than 10%).
Yes, I expect AI labs will run extensive safety tests in the future on their systems before deployment. Mostly this is because I think people will care a lot more about safety as the systems get more powerful, especially as they become more economically significant and the government starts regulating the technology. I think regulatory forces will likely be quite strong at the moment AIs are becoming slightly smarter than humans. Intuitively I anticipate the 5 FTE-year threshold to be well-exceeded before such a model release.
The biggest danger with AIs slightly smarter than the average human is that they will be weaponised, so they’d only safe in a very narrow sense.
I should also note, that if we built an AI that was slightly smarter than the average human all-round, it’d be genius level or at least exceptional in several narrow capabilities, so it’ll be a lot less safe than you might think.