Thanks for the post. I unfortunately mostly feel confused about these framings. I think this problem would be mostly fixed if there were detailed concrete examples of each framing being applied. The descriptions felt very abstract and I mostly don’t think I understood what you were trying to communicate.
During deployment, policies’ goals are very robust to additional reward feedback. It requires penalizing many examples of misbehavior to change their goals.
I think that if/when we get to certain scary points (like the AI proposes a plan for building a new factory, and we’re worried that the overt proposal hides a method by which the AI will exfiltrate itself to run on unsecure servers), reward feedback will not matter much and the AI will be fully robust against reward feedback. If the AI didn’t want to be updated by us anymore, it would probably just modify its own training process to secretly nullify the cognitive updates triggered by reward.
Also see the example of humans: even though humans are very sample-efficient, we have some very strongly-ingrained goals (like the desire to survive, or drives learned during childhood), which we often retain even after a significant amount of negative reinforcement.
I’m guessing that humans get negative reinforcement events which would cause their credit assignment to pintpoint and downweight their desire to survive.
Thanks for the post. I unfortunately mostly feel confused about these framings. I think this problem would be mostly fixed if there were detailed concrete examples of each framing being applied. The descriptions felt very abstract and I mostly don’t think I understood what you were trying to communicate.
I think that if/when we get to certain scary points (like the AI proposes a plan for building a new factory, and we’re worried that the overt proposal hides a method by which the AI will exfiltrate itself to run on unsecure servers), reward feedback will not matter much and the AI will be fully robust against reward feedback. If the AI didn’t want to be updated by us anymore, it would probably just modify its own training process to secretly nullify the cognitive updates triggered by reward.
I’m guessing that humans get negative reinforcement events which would cause their credit assignment to pintpoint and downweight their desire to survive.