Jacy Reese Anthis comments on The Big Picture Of Alignment (Talk Part 1)

Jacy Reese Anthis 6 Jun 2022 0:11 UTC
3 points
Very interesting! Zack’s two questions were also the top two questions that came to mind for me. I’m not sure if you got around to writing this up in more detail, John, but I’ll jot down the way I tentatively view this differently. Of course I’ve given this vastly less thought than you have, so many grains of salt.

On “If this is so hard, how do humans and other agents arguably do it so easily all the time?”, how meaningful is the notion of extra parameters if most agents are able to find uses for any parameters, even just through redundancy or error-correction (e.g., in case one base pair changes through exaptation or useless mutation)? In alignment, why assume that all aligned AIs “look like they work”? Why assume that these are binaries? Etc. In general, there seem to be many realistic additions to your model that mitigate this exponential-increase-in-possibilities challenge and seem to more closely fit real-world agents who are successful. I don’t see as many such additions that would make the optimization even more challenging.

On generators, why should we carve such a clear and small circle around genes as the generators? Rob mentioned the common thought experiment of alien worlds in which genes produce babies who grow up in isolation from human civilization, and I would push on that further. Even on Earth, we have Stone Age values versus modern values, and if you draw the line more widely (either by calling more things generators or including non-generators), this notion of “generators of human values” starts to seem very narrow and much less meaningful for alignment or a general understanding of agency, which I think most people would say requires learning more values than what is in our genes. I don’t think “feed an AI data” gets around this: AIs already have easy access to genes and to humans of all ages. There is an advantage to telling the AI “these are the genes that matter,” but could it really just take those genes or their mapping onto some value space and raise virtual value-children in a useful way? How do they know they aren’t leaving out the important differentiators between Stone Age and modern values, genetic or otherwise? How would they adjudicate between all the variation in values from all of these sources? How could we map them onto trade-offs suitable for coherence conditions? Etc.