When I say “policy”, I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say “evolution is directly selecting the policy”, I mean that genotypes are selected based on their “expected reward” (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that’s something the outer loop has to learn.
Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don’t just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data.
When I say “policy”, I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say “evolution is directly selecting the policy”, I mean that genotypes are selected based on their “expected reward” (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that’s something the outer loop has to learn.
Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don’t just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data.