Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”?
That’s my reading, yeah, and I agree it’s strained. But yes, the ‘internal action’ of even ‘thinking about how to’ optimise for reward may be not trivial to discover.
Separately, the action-weight downstream of that ‘thinking’ has to yield better actions than whatever the action results of the ‘rest of’ cognition are, to be reinforced (it stands to reason that they might, but plausibly heuristics amounting to ‘shaped’ value and reward proxies are easier to get right, hence inner misalignment).
I agree that once you find ways to directly seek reward you’re liable to get hooked to some extent.
I think this sort of thing is worth trying to get nuance on, but I certainly don’t personally derive much hope from it directly (I think this sort of reasoning may lead to useable insights though).
That’s my reading, yeah, and I agree it’s strained. But yes, the ‘internal action’ of even ‘thinking about how to’ optimise for reward may be not trivial to discover.
Separately, the action-weight downstream of that ‘thinking’ has to yield better actions than whatever the action results of the ‘rest of’ cognition are, to be reinforced (it stands to reason that they might, but plausibly heuristics amounting to ‘shaped’ value and reward proxies are easier to get right, hence inner misalignment).
I agree that once you find ways to directly seek reward you’re liable to get hooked to some extent.
I think this sort of thing is worth trying to get nuance on, but I certainly don’t personally derive much hope from it directly (I think this sort of reasoning may lead to useable insights though).