This sort of thing is why I am optimistic about alignment. I think it will look ridiculously easy and obvious in hindsight—because as you say, evolution does not have access to any information about people’s world models yet has been able to steer us towards all the values we have. That implies we probably can steer AI towards good values without needing to understand its mind.
It’s unfortunate that I cannot remember my early childhood well enough to figure out how my values developed out of raw sensory data. For instance, I don’t remember the first time I felt compassion for an entity that was suffering; I have no idea how I determined that this was in fact happening. My suspicion is that care for entities being harmed is learned by having the brain attend to sounds that resemble a crying baby, and then gradually generalize from there, but that’s probably not the whole story. (And that’s only one of many values, of course.)
This sort of thing is why I am optimistic about alignment. I think it will look ridiculously easy and obvious in hindsight—because as you say, evolution does not have access to any information about people’s world models yet has been able to steer us towards all the values we have. That implies we probably can steer AI towards good values without needing to understand its mind.
It’s unfortunate that I cannot remember my early childhood well enough to figure out how my values developed out of raw sensory data. For instance, I don’t remember the first time I felt compassion for an entity that was suffering; I have no idea how I determined that this was in fact happening. My suspicion is that care for entities being harmed is learned by having the brain attend to sounds that resemble a crying baby, and then gradually generalize from there, but that’s probably not the whole story. (And that’s only one of many values, of course.)