paulfchristiano comments on Another (outer) alignment failure story

paulfchristiano 12 Apr 2021 18:13 UTC
LW: 12 AF: 9
AF
I’m a bit surprised that the outcome is worse than you expect, considering that this scenario is “easy mode” for societal competence and inner alignment, which seem to me to be very important parts of the overall problem.
The main way it’s worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.
Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?
I don’t think it’s right to infer much about my stance on inner vs outer alignment. I don’t know if it makes sense to split out “social competence” in this way.
In this story, there aren’t any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it’s more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven’t had major war for seventy years, and maybe that’s because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff?
The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the “takeoff” part of the story is subjectively shorter than the last 70 years.
IDK, I worry that the reasons why we haven’t had war for seventy years may be largely luck / observer selection effects, and also separately even if that’s wrong
I’m extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.
Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on “under the hood” so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future?
I don’t think the AI systems are all on the same team. That said, to the extent that there are “humans are deluded” outcomes that are generally preferable according to many AI’s values, I think the AIs will tend to bring about such outcomes. I don’t have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the “AI’s generalize ‘correctly’” assumption, so this story probably feels a bit more like “us vs them” than a story that relaxed that assumption.
Why aren’t they fighting each other as well as the humans? Or maybe they do fight each other but you didn’t focus on that aspect of the story because it’s less relevant to us?
I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald’s and Burger King’s marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?
I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren’t even as superficially aligned as the unaligned benchmark. They won’t even be trying to make things look good according to human judgment, much less augmented human judgment!
I’m imagining that’s the case in this story.
Failure is early enough in this story that e.g. the human’s investment in sensor networks and rare expensive audits isn’t slowing them down very much compared to the “rogue” AI.
Such “rogue” AI could provide a competitive pressure, but I think it’s a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).
Can you say more about how “the failure modes in this story are an important input into treachery?”
We will be deploying many systems to anticipate/prevent treachery. If we could stay “in the loop” in the sense that would be needed to survive this outer alignment story, then I think we would also be “in the loop” in roughly the sense needed to avoid treachery. (Though it’s not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)
- CarlShulman 12 Apr 2021 21:45 UTC
  LW: 9 AF: 6
  AF Parent
  I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald’s and Burger King’s marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?
  I think the one that stands out the most is ‘why isn’t it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?’ I understand the scenario says it isn’t because the demonstrations are incomprehensible, but why/how?
  - paulfchristiano 12 Apr 2021 23:33 UTC
    LW: 7 AF: 7
    AF Parent
    I understand the scenario say it isn’t because the demonstrations are incomprehensible
    Yes, if demonstrations are comprehensible then I don’t think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.
    why/how?
    The global camera grab must involve plans that aren’t clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don’t understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that’s about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can’t figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can’t really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).
    (There is obviously a lot you could say about all the tools at the human’s disposal to circumvent this kind of problem.)
    This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) “correct” generalization.
    Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can’t solve alignment well enough in the intervening time, I do agree that it’s unlikely we can solve it in advance.)