Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

Bogdan Ionut Cirstea 26 Apr 2024 10:24 UTC
3 points
0
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
What links here?
- Bogdan Ionut Cirstea's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (29 Apr 2024 13:59 UTC; 4 points)
- ryan_greenblatt 26 Apr 2024 16:36 UTC
  4 points
  0
  Parent
  
  But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
  
  Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
  
  I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
  - Bogdan Ionut Cirstea 26 Apr 2024 18:02 UTC
    3 points
    0
    Parent
    Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
    This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
    What links here?
    Bogdan Ionut Cirstea's comment on Schelling game evaluations for AI control by Olli Järviniemi (10 Oct 2024 14:42 UTC; 2 points)