Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.
Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.