Gordon Seidoh Worley comments on Schelling Shifts During AI Self-Modification

Gordon Seidoh Worley 16 Apr 2018 23:36 UTC
3 points
I don’t know, I’m increasingly less convinced that we should reasonably expect to not see value drift. In particular value drift can at least be a function of computing reflective equilibrium such that values may drift from their original position in order to be consistent with other values. In this sense the original value might be thought of as mistaken, and it could be a correct move to drift towards a value that is stable under reflection, and this is to say nothing of “drift” as a result of updating on new information.

Put another way, it seems unlikely to me that we can build AGI that is both fully general and not open to instability under self-modification and in order to get greater stability we must give up some general function. Arguably this is exactly what alignment is—giving up access to parts of mind space in exchange for meeting particular safety guarantees—but I think it’s also worth pointing out that there may be a sense in which we can oversolve alignment such that we remove all value drift rendering the intended AGI narrow rather than general.