Steven Byrnes comments on Note on algorithms with multiple trained components

Steven Byrnes Dec 21, 2022, 2:35 PM
LW: 2 AF: 2
0
AF
Here’s a question:
In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that:
- IF an intervention makes the value function strictly more accurate as an approximation of expected future reward,
- THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want?
I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist.
(For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)