Seth Herd comments on Alignment: “Do what I would have wanted you to do”

Seth Herd 13 Jul 2024 11:38 UTC
2 points
0
It seems like all of the many correct answers to what X would’ve wanted might not include the AGI killing everyone.

Wrt the continuity property, I think Max Harm’s corrigibility proposal has that, without suffering as obviously from the multiple interpretations you mention. Ambitious value learning is intended to as well, but has more of that problem. Roger Dearnaley’s alignment as a basin of attraction addresses that stability property more directly. Sorry I don’t have links handy.
- JuliaHP 13 Jul 2024 12:28 UTC
  8 points
  0
  Parent
  >It seems like all of the many correct answers to what X would’ve wanted might not include the AGI killing everyone.
  Yes, but if it wants to kill everyone it would pick one which does. The space “all possible actions” also contains some friendly actions.
  
  >Wrt the continuity property, I think Max Harm’s corrigibility proposal has that
  I think it understands this and is aiming to have that yeah. It looks like a lot of work needs to be done to flesh it out.
  
  I dont have a good enough understanding of ambitious value learning & Roger Dearnaleys proposal to properly comment on these. Skimming + priors put fairly low odds on that they deal with this in the proper manner, but I could be wrong.
  - Seth Herd 13 Jul 2024 21:10 UTC
    2 points
    0
    Parent
    I don’t think Dearnaley’s proposal is detailed enough to establish whether or not it would really in practice have a “basin of attraction”. I take it to be roughly the same idea as ambitious value learning and CEV. All of them might be said to have a basin of attraction (and therefore your continuity property) for this reason: if they initially misunderstand what humans want initially (a form of your delta) they should work to understand it better and make sure they understand it, as a byproduct of having their goal be not a certain set of outcomes, but a variable standing for outcomes humans prefer, while the exact value of that variable can remain unknown and refined as one possible sub-goal.
    Another related thing that springs to mind: all goals may have your continuity property with a slightly different form of delta. If an AGI has one main goal, and a few other less important goals/values, those might (in some decision-making processes) be eliminated in favor of the more important goal (if continuing to have those minor goals would hurt its ability to achieve the more important goal).
    The other important piece to note about the continuity property is that we don’t know how large a delta would be ruinous. It’s been said that “value is fragile” but the post But exactly how complex and fragile? got almost zero meaningful discussion. Nobody knows until we get around to working that out. It could be that a small delta in some AGI architectures would just result in a world with slightly more things like dance parties and slightly less things like knitting circles, disappointing to knitters but not at all catastrophic. I consider that another important unresolved issue.
    Back to your intial point: I agree that other preferences could interact disastrously with the indeterminacy of something like CEV. But it’s hard for me to imagine an AGI whose goal is to do what humanity wants but also has a preference for wiping out humanity. But it’s not impossible. I guess with the complexity of pseudo-goals in a system like an LLM, it’s probably something we should be careful of.