Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).