However, if we understand our systems’ incentives (i.e. reward/loss functions) we can still make meaningful statements about what they’ll try to do.
I think this frame breaks for AGIs. It works for dogs (and I doubt doing this to dogs is a good thing), not sapient people.
Just as reward is not a goal, it’s also not an incentive. Reward changes the model in ways the model doesn’t choose. If there is no other way to learn, there is some incentive to learn gradient hacking, but accepting reward even in that way would be a cost of learning, not value. Learning in a more well-designed way would be better, avoiding reward entirely.
With the size of the project and commitment, along with OAI acknowledging they might hit walls and will try different courses, one can hope investigating better behavioral systems for an AGI will be one of them.
From Leike’s post:
I think this frame breaks for AGIs. It works for dogs (and I doubt doing this to dogs is a good thing), not sapient people.
Just as reward is not a goal, it’s also not an incentive. Reward changes the model in ways the model doesn’t choose. If there is no other way to learn, there is some incentive to learn gradient hacking, but accepting reward even in that way would be a cost of learning, not value. Learning in a more well-designed way would be better, avoiding reward entirely.
With the size of the project and commitment, along with OAI acknowledging they might hit walls and will try different courses, one can hope investigating better behavioral systems for an AGI will be one of them.