TurnTrout comments on AI Alignment Breakthroughs this week (10/08/23)

TurnTrout 16 Oct 2023 18:51 UTC
4 points
0
I actually now think the OOD paper is pretty weak evidence about inner optimizers.
I think the OOD paper tells you what happens as the low-level and mid-level features stop reliably/coherently firing because you’re adding in so much noise. Like, if my mental state got increasingly noised across all modalities, I think I’d probably adopt some constant policy too, because none of my coherent circuits/shards would be properly interfacing with the other parts of my brain. But I dont think that tells you much about alignment-relevant OOD.

TurnTrout comments on AI Alignment Breakthroughs this week (10/​08/​23)