I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.
I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.
Related clarifying question:
To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit[...]
The shareholders of such a DAO may be able to capture the value it creates as long as they are able to retain effective control over its computing hardware / reward signal.
Do you remember why you thought it important that the shareholders retain control over the reward signal? I can think of several interpretations, but am curious what your take was.
I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.
Related clarifying question:
Do you remember why you thought it important that the shareholders retain control over the reward signal? I can think of several interpretations, but am curious what your take was.