This is because evolution is directly selecting the policy
Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced.
The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.
When I say “policy”, I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say “evolution is directly selecting the policy”, I mean that genotypes are selected based on their “expected reward” (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that’s something the outer loop has to learn.
Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don’t just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data.
Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced.
The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.
When I say “policy”, I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say “evolution is directly selecting the policy”, I mean that genotypes are selected based on their “expected reward” (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that’s something the outer loop has to learn.
Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don’t just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data.