I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)
I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)