We don’t need a function of sensory input which is safe to maximize, that’s not the function of the reward signal. Reward chisels cognition. Reward is not necessarily—nor do we want it to be—a ground-truth signal about alignment.
I’m confused about this statement. How can reward be unnecessary as a ground-truth signal about alignment? Especially if “reward chisels cognition”?
How can reward be unnecessary as a ground-truth signal about alignment? Especially if “reward chisels cognition”?
Reward’s purpose isn’t to demarcate “this was good by my values.” That’s one use, and it often works, but it isn’t intrinsic to reward’s mechanistic function. Reward develops certain kinds of cognition / policy network circuits. For example, reward shaping a dog to stand on its hind legs. I don’t reward the dog because I intrinsically value its front paws being slightly off the ground for a moment. I reward the dog at that moment because that helps develop the stand-up cognition in the dog’s mind.
I’m confused about this statement. How can reward be unnecessary as a ground-truth signal about alignment? Especially if “reward chisels cognition”?
Reward’s purpose isn’t to demarcate “this was good by my values.” That’s one use, and it often works, but it isn’t intrinsic to reward’s mechanistic function. Reward develops certain kinds of cognition / policy network circuits. For example, reward shaping a dog to stand on its hind legs. I don’t reward the dog because I intrinsically value its front paws being slightly off the ground for a moment. I reward the dog at that moment because that helps develop the stand-up cognition in the dog’s mind.