Thanks for this, this is awesome! I’m hopeful in the next few years for there to be a collection of stories like this.
This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes.
I’m a bit surprised that the outcome is worse than you expect, considering that this scenario is “easy mode” for societal competence and inner alignment, which seem to me to be very important parts of the overall problem. Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?
Some other threads to pull on:
--In this story, there aren’t any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it’s more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven’t had major war for seventy years, and maybe that’s because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff? IDK, I worry that the reasons why we haven’t had war for seventy years may be largely luck / observer selection effects, and also separately even if that’s wrong, I worry that the reasons won’t persist through takeoff (e.g. some factions may develop ways to shoot down ICBMs, or prevent their launch in the first place, or may not care so much if there is nuclear winter)
--Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on “under the hood” so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future? Why aren’t they fighting each other as well as the humans? Or maybe they do fight each other but you didn’t focus on that aspect of the story because it’s less relevant to us?
--Yeah, society will very likely not be that competent IMO. I think that’s the biggest implausibility of this story so far.
--(Perhaps relatedly) I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren’t even as superficially aligned as the unaligned benchmark. They won’t even be trying to make things look good according to human judgment, much less augmented human judgment! For example, some AI scientists today seem to think that all we need to do is make our AI curious and then everything will work out fine. Others seem to think that it’s right and proper for humans to be killed and replaced by machines. Others will try strategies even more naive than the unaligned benchmark, such as putting their AI through some “ethics training” dataset, or warning their AI “If you try anything I’ll unplug you.” (I’m optimistic that these particular failure modes will have been mostly prevented via awareness-raising before takeoff, but I do a pessimistic meta-induction and infer there will be other failure modes that are not prevented in time.)
--Can you say more about how “the failure modes in this story are an important input into treachery?”
I’m a bit surprised that the outcome is worse than you expect, considering that this scenario is “easy mode” for societal competence and inner alignment, which seem to me to be very important parts of the overall problem.
The main way it’s worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.
Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?
I don’t think it’s right to infer much about my stance on inner vs outer alignment. I don’t know if it makes sense to split out “social competence” in this way.
In this story, there aren’t any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it’s more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven’t had major war for seventy years, and maybe that’s because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff?
The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the “takeoff” part of the story is subjectively shorter than the last 70 years.
IDK, I worry that the reasons why we haven’t had war for seventy years may be largely luck / observer selection effects, and also separately even if that’s wrong
I’m extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.
Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on “under the hood” so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future?
I don’t think the AI systems are all on the same team. That said, to the extent that there are “humans are deluded” outcomes that are generally preferable according to many AI’s values, I think the AIs will tend to bring about such outcomes. I don’t have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the “AI’s generalize ‘correctly’” assumption, so this story probably feels a bit more like “us vs them” than a story that relaxed that assumption.
Why aren’t they fighting each other as well as the humans? Or maybe they do fight each other but you didn’t focus on that aspect of the story because it’s less relevant to us?
I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald’s and Burger King’s marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?
I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren’t even as superficially aligned as the unaligned benchmark. They won’t even be trying to make things look good according to human judgment, much less augmented human judgment!
I’m imagining that’s the case in this story.
Failure is early enough in this story that e.g. the human’s investment in sensor networks and rare expensive audits isn’t slowing them down very much compared to the “rogue” AI.
Such “rogue” AI could provide a competitive pressure, but I think it’s a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).
Can you say more about how “the failure modes in this story are an important input into treachery?”
We will be deploying many systems to anticipate/prevent treachery. If we could stay “in the loop” in the sense that would be needed to survive this outer alignment story, then I think we would also be “in the loop” in roughly the sense needed to avoid treachery. (Though it’s not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)
I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald’s and Burger King’s marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?
I think the one that stands out the most is ‘why isn’t it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?’ I understand the scenario says it isn’t because the demonstrations are incomprehensible, but why/how?
I understand the scenario say it isn’t because the demonstrations are incomprehensible
Yes, if demonstrations are comprehensible then I don’t think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.
why/how?
The global camera grab must involve plans that aren’t clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don’t understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that’s about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can’t figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can’t really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).
(There is obviously a lot you could say about all the tools at the human’s disposal to circumvent this kind of problem.)
This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) “correct” generalization.
Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can’t solve alignment well enough in the intervening time, I do agree that it’s unlikely we can solve it in advance.)
Thanks for this, this is awesome! I’m hopeful in the next few years for there to be a collection of stories like this.
I’m a bit surprised that the outcome is worse than you expect, considering that this scenario is “easy mode” for societal competence and inner alignment, which seem to me to be very important parts of the overall problem. Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?
Some other threads to pull on:
--In this story, there aren’t any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it’s more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven’t had major war for seventy years, and maybe that’s because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff? IDK, I worry that the reasons why we haven’t had war for seventy years may be largely luck / observer selection effects, and also separately even if that’s wrong, I worry that the reasons won’t persist through takeoff (e.g. some factions may develop ways to shoot down ICBMs, or prevent their launch in the first place, or may not care so much if there is nuclear winter)
--Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on “under the hood” so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future? Why aren’t they fighting each other as well as the humans? Or maybe they do fight each other but you didn’t focus on that aspect of the story because it’s less relevant to us?
--Yeah, society will very likely not be that competent IMO. I think that’s the biggest implausibility of this story so far.
--(Perhaps relatedly) I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren’t even as superficially aligned as the unaligned benchmark. They won’t even be trying to make things look good according to human judgment, much less augmented human judgment! For example, some AI scientists today seem to think that all we need to do is make our AI curious and then everything will work out fine. Others seem to think that it’s right and proper for humans to be killed and replaced by machines. Others will try strategies even more naive than the unaligned benchmark, such as putting their AI through some “ethics training” dataset, or warning their AI “If you try anything I’ll unplug you.” (I’m optimistic that these particular failure modes will have been mostly prevented via awareness-raising before takeoff, but I do a pessimistic meta-induction and infer there will be other failure modes that are not prevented in time.)
--Can you say more about how “the failure modes in this story are an important input into treachery?”
The main way it’s worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.
I don’t think it’s right to infer much about my stance on inner vs outer alignment. I don’t know if it makes sense to split out “social competence” in this way.
The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the “takeoff” part of the story is subjectively shorter than the last 70 years.
I’m extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.
I don’t think the AI systems are all on the same team. That said, to the extent that there are “humans are deluded” outcomes that are generally preferable according to many AI’s values, I think the AIs will tend to bring about such outcomes. I don’t have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the “AI’s generalize ‘correctly’” assumption, so this story probably feels a bit more like “us vs them” than a story that relaxed that assumption.
I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald’s and Burger King’s marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?
I’m imagining that’s the case in this story.
Failure is early enough in this story that e.g. the human’s investment in sensor networks and rare expensive audits isn’t slowing them down very much compared to the “rogue” AI.
Such “rogue” AI could provide a competitive pressure, but I think it’s a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).
We will be deploying many systems to anticipate/prevent treachery. If we could stay “in the loop” in the sense that would be needed to survive this outer alignment story, then I think we would also be “in the loop” in roughly the sense needed to avoid treachery. (Though it’s not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)
I think the one that stands out the most is ‘why isn’t it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?’ I understand the scenario says it isn’t because the demonstrations are incomprehensible, but why/how?
Yes, if demonstrations are comprehensible then I don’t think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.
The global camera grab must involve plans that aren’t clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don’t understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that’s about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can’t figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can’t really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).
(There is obviously a lot you could say about all the tools at the human’s disposal to circumvent this kind of problem.)
This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) “correct” generalization.
Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can’t solve alignment well enough in the intervening time, I do agree that it’s unlikely we can solve it in advance.)