Seth Herd comments on Alignment: “Do what I would have wanted you to do”

Seth Herd 13 Jul 2024 21:10 UTC
2 points
0
I don’t think Dearnaley’s proposal is detailed enough to establish whether or not it would really in practice have a “basin of attraction”. I take it to be roughly the same idea as ambitious value learning and CEV. All of them might be said to have a basin of attraction (and therefore your continuity property) for this reason: if they initially misunderstand what humans want initially (a form of your delta) they should work to understand it better and make sure they understand it, as a byproduct of having their goal be not a certain set of outcomes, but a variable standing for outcomes humans prefer, while the exact value of that variable can remain unknown and refined as one possible sub-goal.
Another related thing that springs to mind: all goals may have your continuity property with a slightly different form of delta. If an AGI has one main goal, and a few other less important goals/values, those might (in some decision-making processes) be eliminated in favor of the more important goal (if continuing to have those minor goals would hurt its ability to achieve the more important goal).
The other important piece to note about the continuity property is that we don’t know how large a delta would be ruinous. It’s been said that “value is fragile” but the post But exactly how complex and fragile? got almost zero meaningful discussion. Nobody knows until we get around to working that out. It could be that a small delta in some AGI architectures would just result in a world with slightly more things like dance parties and slightly less things like knitting circles, disappointing to knitters but not at all catastrophic. I consider that another important unresolved issue.
Back to your intial point: I agree that other preferences could interact disastrously with the indeterminacy of something like CEV. But it’s hard for me to imagine an AGI whose goal is to do what humanity wants but also has a preference for wiping out humanity. But it’s not impossible. I guess with the complexity of pseudo-goals in a system like an LLM, it’s probably something we should be careful of.