In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that:
IF an intervention makes the value function strictly more accurate as an approximation of expected future reward,
THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want?
I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist.
(For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)
Here’s a question:
In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that:
IF an intervention makes the value function strictly more accurate as an approximation of expected future reward,
THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want?
I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist.
(For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)