I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures.
In the wire-heading case this feels different in that you have essentially two separate value functions—a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience. The difference here is mostly in training data—the cortex is trained on a large sensory corpus including linguistic text describing wire heading. The subcortical value function is largely trained on personal rewarding experiences. It would be odd to have them necessarily be always consistent and would lead to strange failure modes exactly like wire heading, or generally being able to be viscerally convinced of anything you read that sounds convincing.
In the wire-heading case this feels different in that you have essentially two separate value functions—a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience.
I guess I want to call the second one “the actual value function defined in the agent’s source code” and the first one “the agent’s learned concept of ‘value function’” (or relatedly, “the agent’s learned concept of ‘pleasure’” / “the agent’s learned concept of ‘satisfaction’” / whatever).
Other than that, I don’t think we’re in disagreement about anything, AFAICT.
I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures.
In the wire-heading case this feels different in that you have essentially two separate value functions—a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience. The difference here is mostly in training data—the cortex is trained on a large sensory corpus including linguistic text describing wire heading. The subcortical value function is largely trained on personal rewarding experiences. It would be odd to have them necessarily be always consistent and would lead to strange failure modes exactly like wire heading, or generally being able to be viscerally convinced of anything you read that sounds convincing.
I guess I want to call the second one “the actual value function defined in the agent’s source code” and the first one “the agent’s learned concept of ‘value function’” (or relatedly, “the agent’s learned concept of ‘pleasure’” / “the agent’s learned concept of ‘satisfaction’” / whatever).
Other than that, I don’t think we’re in disagreement about anything, AFAICT.