Seth Herd answers How can we prevent AGI value drift?

Seth Herd 20 Nov 2024 19:16 UTC
8 points
4
I gave part of my answer in the thread where you first asked this question. Here’s the rest.
TLDR: Value alignment is too hard even without the value stability problem. Goal-misspecification is too likely (I realize I don’t know the best ref for this other than LoL—anyone else have a better central ref?). Therefore we’ll very likely align our first AGIs to follow instructions, and use that as a stepping-stone to full value alignment.
This is something I used to worry about a lot. Now it’s something I don’t worry about it at all.
I wrote a paper on this, Goal changes in intelligent agents back in 2018 for a small FLI grant, (in perhaps the first round of public funds for AGI x-risk). One of my first posts on LW was The alignment stability problem.
I still think this would be a very challenging problem if we were designing a value-aligned autonomous AGI. Now I don’t think we’re going to do that.
I now see goal mis-specification as a very hard problem, and one we don’t need to tackle to create autonomous AGI or even superintelligence. Therefore I think we won’t.
Instead we’ll make the central goal of our first AGIs to follow instructions or to be corrigible (correctable).
It’s counterintuitive to think of a highly intelligent and fully autonomous being that wants more than anything to do what a less intelligent human tells them to do. But I think it’s completely possible, and a much safer option for our first AGIs.
This is much simpler than trying to instill our values with such accuracy that we’d be happy with the result. Neither showing examples of things we like (as in RL training) nor explicitly stating our values in natural language seems likely to be accurate enough after it’s been interpreted by a superintelligent AGI that is likely to see the world at least somewhat differently than we do. That sort of re-interpretation is functionally similar to value drift, although it’s separable. Adding the problem of actual value drift on top of the dangers of goal misspecification just makes things worse.
Aligning an AGI to follow instructions isn’t trivial either, but it’s a lot easier to specify than getting values right and stable. For instance, LLMs already largely “know” what people tend to mean by instructions—and that’s before the checking phase of do what I mean and check (DWIMAC).
Primarily, though, instruction-following has the enormous advantage of allowing for corrigibility—you can tell your AGI to shut down to accept changes, or issue new revised instructions if/when you realize (likely because you asked the AGI) that your instructions would be interpreted differently than you’d like.
If that works and we get superhuman AGI aligned to follow instructions, we’ll probably want to use that AGI to help us solve the problem of full value alignment, including solving value drift. We won’t want to launch an autonomous AGI that’s not corrigible/instruction-following until we’re really sure our AGIs have a sure solution. (This is assuming we have those AGIs controlled by humans who are ethical enough to release control of the future into better hands once they’re available—a big if).
- Dakara 24 Nov 2024 8:12 UTC
  3 points
  0
  Parent
  What’s most worrying is the fact that in your post If we solve alignment, do we die anyway? you mentioned your worries about multipolar scenarios. However, I am not sure we’d be much better off in a unipolar scenario, though. If there is one group of people controlling AGI, then it might be actually even harder to make them give it up. They’d have a large amount of power and no real threat to it (no multipolar AGIs threatening to launch an attack).
  
  However, I am not well-versed in literature on this topic, so if there is any plan for how we can safeguard ourself in such a scenario (unipolar AGI control), then I’d be very very happy to learn about it.
  - Seth Herd 26 Nov 2024 4:16 UTC
    4 points
    2
    Parent
    I think that’s a pretty reasonable worry. And a lot of people share it. Here’s my brief take.
    Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
    I’m less worried about that because it seems like one questionable group with tons of power is way better than a bunch of questionable groups with tons of power—if the offense-defense balance tilts toward offense, which I think it does. The more groups, the more chance that someone uses it for ill.
    Here’s one update on my thinking: mutually assured destruction will still work for most of the world. ICBMs with nuclear payloads will be obsoleted at some point, but AGIs will also likely be told to find even better/worse ways to destroy stuff. So possibly everyone with an AGI will go ahead and hold the whole earth hostage, just so whoever starts a war doesn’t get to keep any of their stuff they were keeping on the planet. That makes the incentive to get off planet and possibly keep going.
    It’s really hard to see how this stuff plays out, but I suspect it will be obvious what the constraints and incentives and distribution of psychologies was in retrospect. So I appreciate your help in thinking through it. We don’t have answers yet, but they may be out there.
    I don’t think it would be much harder for a group to give it up if they were the only ones who had it. And maybe there’s not much difference between a full renunciation of control and just saying “oh fine, I’m tired of running the world, do whatever it seems like everybody wants but check major changes with me in case I decide to throw my weight around instead of hanging out in the land of infinite fun”.
    - Dakara 26 Nov 2024 20:27 UTC
      1 point
      0
      Parent
      After reading your comment I do agree that unipolar AGI scenario is probably better than a multipolar plan. Perhaps I underestimated how offense-favored our world is.
      
      With that aside, your plan is possibly one of the clearest, most intuitive alignment plans that I’ve seen. All of the steps make sense and seem decently likely to happen, except maybe for one. I am not sure that your argument for why we have good odds for getting AGI into trustworthy hands works.
      
      “It seems as though we’ve got a decent chance of getting that AGI into a trustworthy-enough power structure, although this podcast shifted my thinking and lowered my odds of that happening.
      
      Half of the world, and the half that’s ahead in the AGI race right now, has been doing very well with centralized power for the last couple of centuries.”
      
      I think that actually, the half with the most centralized power is doing really poorly, that’s the half of the world, which still has corrupt dictatorships and juntas. I actually think that the West has, relatively speaking, a pretty decentralized system. In order to do any important action, your proposal has to pass through multiple stages of verification and approval. It is often enough for a bill to fail one of these stages to not get passed.
      
      Furthermore, another potential problem that I see is that even in democracies, we still manage to elect selfish, corrupt and power-hungry individuals, when the entire goal of the election system is do optimize for opposite qualities. I am not sure how we will be able to overcome that hurdle.
      
      But I suspect I might’ve misunderstood your argument and if that’s the case, or if you have some other reasons for thinking that we can get AGI into safe hands (and prevent the “totalitarian misuse” scenario) then I’d be more than happy to learn about them. I think this is the biggest bottleneck of the entire plan and removing it would be really valuable.
      - Seth Herd 27 Nov 2024 5:02 UTC
        6 points
        2
        Parent
        I wish the odds for getting AGI into trustworthy hands were better. The source of my optimism is the hope that those hands just need to be decent—to have what I’ve conceptualized as a positive empathy—sadism balance. That’s anyone who’s not a total sociopath (lacking empathy and tending toward vengeance and competition) and/or sadist. I hope that about 90-99% of humanity would eventually make the world vastly better with their AGI, just because it’s trivially easy for them to do, so it only requires the smallest bit of goodwill.
        
        I wish I were more certain of that. I’ve tried to look a little at some historical examples of rulers born into power and with little risk of losing it. A disturbing number of them were quite callous rulers. They were usually surrounded by a group of advisors that got them to ignore the plight of the masses and focus on the concerns of an elite few. But this situation isn’t analogous—once your AGI hits superintelligence, it would be trivially easy to both help the masses in profound ways, and pursue whatever crazy schemes you and your friends have come up with. Thus my limited optimism.
        
        WRT the distributed power structure of Western governments: I think AGI would be placed under executive authority, like the armed forces, and the US president and those with similar roles in other countries would hold near-total power, should they choose to use it. They could transform democracies into dictatorships with ease. And we very much do continue to elect selfish and power-hungry individuals, some of whom probably actually have a negative empathy-sadism balance.
        
        Looking back, I note that you said I argued for “good odds” while I said “decent odds”. We may be in agreement on the odds.
        
        But there’s more to consider here. Thanks again for engaging; I’d like to get more discussion of this topic going. I doubt you or I are seeing all of the factors that will be obvious in retrospect yet.