Thanks for the reply! I feel like a loss term that uses the ground truth reward is “cheating.” Maybe one could get information from how a feature impacts behavior—but in this case it’s difficult to disentangle what actually happens from what the agent “thought” would happen. Although maybe it’s inevitable that to model what a system wants, you also have to model what it believes.
Thanks for the reply! I feel like a loss term that uses the ground truth reward is “cheating.” Maybe one could get information from how a feature impacts behavior—but in this case it’s difficult to disentangle what actually happens from what the agent “thought” would happen. Although maybe it’s inevitable that to model what a system wants, you also have to model what it believes.