But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn’t change?
Congenitally blind people end up with human values, despite that:
They’re missing entire chunks of vision-related hard coded rewards.
The entire visual cortex has been repurposed for other goals.
Evolution probably could not have “patched” the value formation process of blind people in the ancestral environment due to the massive fitnesses disadvantage blindness confers.
Human value formation can’t be that sensitive to delicate parameters of the learning process or reward circuitry.
And we would need to find these somehow for AI and that process looks much more like evolution.
We could learn a reward model from human judgements, train on human judgements directly, finetune a language model, etc. There are many options here.
And prediction of (non-instrumental) inner values is not robust across different reward functions—dogs only work because we already implemented environment-invariant compassion in reward circuitry.
I don’t agree. If you slightly increase the strength of the reward circuits that rewarded the person for interacting with dogs, you get someone who likes dogs a bit more, not someone who wants to tile the universe with tiny molecular dog faces.
Also, reward circuitry does not, and cannot implement compassion directly. It can only reward you for taking actions that were probably driven by compassion. This is a very dumb approach that nevertheless actually literally works in actual reality, so the problem can’t be that hard.
Congenitally blind people end up with human values, despite that:
They’re missing entire chunks of vision-related hard coded rewards.
The entire visual cortex has been repurposed for other goals.
Evolution probably could not have “patched” the value formation process of blind people in the ancestral environment due to the massive fitnesses disadvantage blindness confers.
Human value formation can’t be that sensitive to delicate parameters of the learning process or reward circuitry.
We could learn a reward model from human judgements, train on human judgements directly, finetune a language model, etc. There are many options here.
I don’t agree. If you slightly increase the strength of the reward circuits that rewarded the person for interacting with dogs, you get someone who likes dogs a bit more, not someone who wants to tile the universe with tiny molecular dog faces.
Also, reward circuitry does not, and cannot implement compassion directly. It can only reward you for taking actions that were probably driven by compassion. This is a very dumb approach that nevertheless actually literally works in actual reality, so the problem can’t be that hard.