I agree with the general point here but I think there’s an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.
In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that:
IF an intervention makes the value function strictly more accurate as an approximation of expected future reward,
THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want?
I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist.
(For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)
I agree with the general point here but I think there’s an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.
Here’s a question:
In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that:
IF an intervention makes the value function strictly more accurate as an approximation of expected future reward,
THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want?
I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist.
(For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)