Jemal Young comments on Models Don’t “Get Reward”

Jemal Young 24 Jan 2023 8:54 UTC
2 points
0
I’m struggling to understand how to think about reward. It sounds like if a hypothetical ML model does reward hacking or reward tampering, it would be because the training process selected for that behavior, not because the model is out to “get reward”; it wouldn’t be out to get anything at all. Is that correct?