(EE)CS undergraduate at UC Berkeley
Current intern at CHAI
Previously: high-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine
I often change my mind and don’t necessarily endorse things I’ve written in the past
Its hard for me to imagine a world where we really have internals-based methods that are “extremely well supported theoretically and empirically,” so I notice that I should take a second to try and imagine such a world before accepting the claim that internals-based evidence wouldn’t convince the relevant people...
Today, the relevant people probably wouldn’t do much in response to the interp team saying something like: “our deception SAE is firing when we ask the model bio risk questions, so we suspect sandbagging.”
But I wonder how much of this response is a product of a background assumption that modern-day interp tools are finicky and you can’t always trust them. So in a world where we really have internals-based methods that are “extremely well supported theoretically and empirically,” I wonder if it’d be treated differently?
(I.e. a culture that could respond more like: “this interp tool is a good indicator of whether or not that the model is deceptive, and just because you can get the model to say something bad doesn’t mean its actually bad” or something? Kinda like the reactions to the o1 apollo result)
Edit: Though maybe this culture change would take too long to be relevant.