I’ve actually been thinking about the exact same thing recently! I have a post coming up soon about some of the sorts of concrete experiments I would be excited about re inner alignment that includes an entry on what happens when you give an RL agent access to its reward as part of its observation.
(Edit: I figured I would just publish the post now so you can take a look at it. You can find it here.)
I’ve actually been thinking about the exact same thing recently! I have a post coming up soon about some of the sorts of concrete experiments I would be excited about re inner alignment that includes an entry on what happens when you give an RL agent access to its reward as part of its observation.
(Edit: I figured I would just publish the post now so you can take a look at it. You can find it here.)