It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).