“While the overseer might very well try to determine how effective it’s own actions will be at achieving long-term goals, it never evaluates how effective the model’s actions will be.”
Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?
In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking “I’m going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X”, versus a model that’s thinking “I’m going to pursue goal X”? (To the extent these are different, which I’m still confused about).
I’m also confused.
“While the overseer might very well try to determine how effective it’s own actions will be at achieving long-term goals, it never evaluates how effective the model’s actions will be.”
Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?
In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking “I’m going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X”, versus a model that’s thinking “I’m going to pursue goal X”? (To the extent these are different, which I’m still confused about).