This is a good post! It feels to me like a lot of discussion I’ve recently encountered seem to be converging on this topic, and so here’s something I wrote on Twitter not long ago that feels relevant:
I think most value functions crystallized out of shards of not-entirely-coherent drives will not be friendly to the majority of the drives that went in; in humans, for example, a common outcome of internal conflict resolution is to explicitly subordinate one interest to another.
I basically don’t think this argument differs very much between humans and ASIs; the reason I expect humans to be safe(r) under augmentation isn’t that I expect them not to do the coherence thing, but that I expect them to do it in a way I would meta-endorse.
And so I would predict the output of that reflection process, when run on humans by humans, to be substantially likelier to contain things we from our current standpoint recognize as valuable—such as care for less powerful creatures, less coherent agents, etc.
If you run that process on an arbitrary mind, the stuff inside the world-model isn’t guaranteed to give rise to something similar, because (I predict) the drives themselves will be different, and the meta-reflection/extrapolation process will likewise be different.
This is a good post! It feels to me like a lot of discussion I’ve recently encountered seem to be converging on this topic, and so here’s something I wrote on Twitter not long ago that feels relevant: