Yep, that’s the big one. It suggests that NNs aren’t doing anything very weird if we subject them to OOD inputs, which is evidence against the production of inner optimization, or at least not inner misaligned optimizers.
I actually now think the OOD paper is pretty weak evidence about inner optimizers.
I think the OOD paper tells you what happens as the low-level and mid-level features stop reliably/coherently firing because you’re adding in so much noise. Like, if my mental state got increasingly noised across all modalities, I think I’d probably adopt some constant policy too, because none of my coherent circuits/shards would be properly interfacing with the other parts of my brain. But I dont think that tells you much about alignment-relevant OOD.
Yep, that’s the big one. It suggests that NNs aren’t doing anything very weird if we subject them to OOD inputs, which is evidence against the production of inner optimization, or at least not inner misaligned optimizers.
I actually now think the OOD paper is pretty weak evidence about inner optimizers.
I think the OOD paper tells you what happens as the low-level and mid-level features stop reliably/coherently firing because you’re adding in so much noise. Like, if my mental state got increasingly noised across all modalities, I think I’d probably adopt some constant policy too, because none of my coherent circuits/shards would be properly interfacing with the other parts of my brain. But I dont think that tells you much about alignment-relevant OOD.