lc comments on why assume AGIs will optimize for fixed goals?

lc Jun 10, 2022, 6:25 AM
6 points
Note: I am often not even in the ballpark with this shit.
I am skipping the majority of the content to address an edit other than to say: humans ended up with their specific brand of moral uncertainty as a circumstance of their evolution. That brand still includes, ultimately arbitrary, universals, like valuing human life, which is why we can have moral debates between each other at all. Moral uncertainty of a superpowerful agent between a set of values that are all EV=0 means we still die, and if the average EV of the set is nonzero, it’s probably because we put some of them there. We also still need some way to resolve that uncertainty towards the ones we like, or else it’s either “hedging” our bets or rolling the dice. That resolution process has to be a deliberate, carefully engineered system; there’s no ethical argument you can give me for ending humanity that would be convincing and there’s probably not any ethical argument you can give to a “morally uncertain” Clippy for letting me live.
...to spell out the reason I care about the answer: agents with the “wrapper structure” are inevitably hard to align, in ways that agents without it might not be. An AGI “like me” might be morally uncertain like I am, persuadable through dialogue like I am, etc.
The way I’m parsing “morally uncertain like I am, persuadable through dialogue like I am, etc.”, it sounds like the underlying propert(y/ies) you’re really hoping for by eschewing the fixed goal assumptions might have one of two possible operationalizations.
The first is: an AI that is uncertain about exactly what humans want, but still wants to want what humans want. Instead of taking a static utility function and running with it, one that tries to predict what humans would want if they were highly intelligent. As far as I can tell, that’s Coherent Extrapolated Volition (CEV), and everyone pretty much agrees a workable plan to make one would solve the alignment problem. It’ll be extraordinarily difficult to engineer correctly the first time around, for all of the reasons explained in the AGI ruin post.
The second possibility is: an AI that’s corrigible, one that can be told to stop by existing human operators and updated/modified by its human operators (via “debate” or whatever else) after it’s run the first time. Obviously corrigible AIs aren’t straightforward utilitarians. AFAICT though we don’t have a consistent explanation of how one would even work, and MIRI has tried to find one for the last decade to little success.
So, if your primary purpose in looking into the behavior of agent designs that don’t have very trivially fixed goals is either of these two strategies, you should mention so.