It seems worth noting that there is still a “improve institutions” vs “improve capabilities” race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I’m sympathetic to, but this isn’t really a difference in objective.)
Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin.
Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming “people should do XYZ, this is more leveraged than ABC on current margins under these values”.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I do think there are pretty good reasons to expect human vs AIs, though not super strong reasons.
While there aren’t super strong reasons to expect humans vs AIs, I think conservative assumptions here can be workable and this is at least pretty plausible (see probability above). I expect many conservative interventions to generalize well to more optimistic cases.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
“AIs will be seriously misaligned”
If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
“broadly strategic about achieving long run goals in ways that lead to scheming”
I’d rate this as 65% likely
“present a basically unified front (at least in the context of AIs within a single AI lab)”
For most powerful AIs, I’d rate this as 15% likely
For most powerful AIs within the top AI lab I’d rate this as 25% likely
Conjunction of all these claims:
Taking the conjunction of the strong interpretation of every claim: 3% likely?
Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
The relative value of working on technical misalignment compared to other issues
The relative likelihood of non-misalignment problems relative to misalignment problems
The amount of risk we should be willing to tolerate during the deployment of AIs
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I think if there’s a future conflict between AIs, with humans split between sides of the conflict, it just doesn’t make sense to talk about “misalignment” being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking “us (aligned) vs. AIs (misaligned)” simply falls apart in such scenarios.
(This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.)
E.g. if an AI escapes from OpenAI’s servers and then allies with North Korea, the situation would have been solved without misalignment issues.
You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn’t have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
What do you mean by “misalignment”? In a regime with autonomous AI agents, I usually understand “misalignment” to mean “has different values from some other agent”. In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it’s not really “misaligned” in the abstract—it’s just aligned with someone who we don’t want it to be aligned with. Likewise, if OpenAI develops AI that’s aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse.
In general, conflicts don’t really seem well-described as issues of “misalignment”. Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn’t seem to be the thing causing the issue here.
You could also solve or mitigate the problem by resolving all human conflicts (so the AI doesn’t have a group to ally with)
Note that I’m not saying
AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today
I am saying:
AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don’t seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
Yep, I was just refering to my example scenario and scenarios like this.
Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
Sorry, by “without misalignment” I mean “without misalignment related technical problems”. As in, it’s trivial to avoid misalignment from the perspective of ai creators.
This doesn’t clear up the confusion for me. That mostly pushes my question to “what are misalignment related technical problems?” Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I’m assuming the AI creator didn’t want the AI to escape to north korea and therefore failed at some technical solution to this.
I’m imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.
Some quick notes:
It seems worth noting that there is still a “improve institutions” vs “improve capabilities” race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I’m sympathetic to, but this isn’t really a difference in objective.)
Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin.
Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming “people should do XYZ, this is more leveraged than ABC on current margins under these values”.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I do think there are pretty good reasons to expect human vs AIs, though not super strong reasons.
While there aren’t super strong reasons to expect humans vs AIs, I think conservative assumptions here can be workable and this is at least pretty plausible (see probability above). I expect many conservative interventions to generalize well to more optimistic cases.
I think we should pay the AIs. The exact proposal here is a bit complicated, but one part of the proposal looks like commiting to doing a massive audit of the AI in the after technology progresses considerably and then paying AIs to the extent they didn’t try to screw us over. We should also try to communicate with AIs and understand their preferences and then work out a mutually agreeable deal in the sort term
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
“AIs will be seriously misaligned”
If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
“broadly strategic about achieving long run goals in ways that lead to scheming”
I’d rate this as 65% likely
“present a basically unified front (at least in the context of AIs within a single AI lab)”
For most powerful AIs, I’d rate this as 15% likely
For most powerful AIs within the top AI lab I’d rate this as 25% likely
Conjunction of all these claims:
Taking the conjunction of the strong interpretation of every claim: 3% likely?
Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
The relative value of working on technical misalignment compared to other issues
The relative likelihood of non-misalignment problems relative to misalignment problems
The amount of risk we should be willing to tolerate during the deployment of AIs
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
I’m not conditioning on prior claims.
One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
I think if there’s a future conflict between AIs, with humans split between sides of the conflict, it just doesn’t make sense to talk about “misalignment” being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking “us (aligned) vs. AIs (misaligned)” simply falls apart in such scenarios.
(This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.)
E.g. if an AI escapes from OpenAI’s servers and then allies with North Korea, the situation would have been solved without misalignment issues.
You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn’t have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
What do you mean by “misalignment”? In a regime with autonomous AI agents, I usually understand “misalignment” to mean “has different values from some other agent”. In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it’s not really “misaligned” in the abstract—it’s just aligned with someone who we don’t want it to be aligned with. Likewise, if OpenAI develops AI that’s aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse.
In general, conflicts don’t really seem well-described as issues of “misalignment”. Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn’t seem to be the thing causing the issue here.
Note that I’m not saying
AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today
I am saying:
AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don’t seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
Yep, I was just refering to my example scenario and scenarios like this.
Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
Sorry, by “without misalignment” I mean “without misalignment related technical problems”. As in, it’s trivial to avoid misalignment from the perspective of ai creators.
This doesn’t clear up the confusion for me. That mostly pushes my question to “what are misalignment related technical problems?” Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I’m assuming the AI creator didn’t want the AI to escape to north korea and therefore failed at some technical solution to this.
I’m imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.