In the wire-heading case this feels different in that you have essentially two separate value functions—a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience.
I guess I want to call the second one “the actual value function defined in the agent’s source code” and the first one “the agent’s learned concept of ‘value function’” (or relatedly, “the agent’s learned concept of ‘pleasure’” / “the agent’s learned concept of ‘satisfaction’” / whatever).
Other than that, I don’t think we’re in disagreement about anything, AFAICT.
I guess I want to call the second one “the actual value function defined in the agent’s source code” and the first one “the agent’s learned concept of ‘value function’” (or relatedly, “the agent’s learned concept of ‘pleasure’” / “the agent’s learned concept of ‘satisfaction’” / whatever).
Other than that, I don’t think we’re in disagreement about anything, AFAICT.