ryan_greenblatt comments on Bogdan Ionut Cirstea’s Shortform

ryan_greenblatt 25 Apr 2024 17:13 UTC
4 points
0
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.

So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
- Bogdan Ionut Cirstea 26 Apr 2024 10:24 UTC
  3 points
  0
  Parent
  Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
  What links here?
  - Bogdan Ionut Cirstea's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (29 Apr 2024 13:59 UTC; 4 points)
  - ryan_greenblatt 26 Apr 2024 16:36 UTC
    4 points
    0
    Parent
    
    But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
    
    Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
    
    I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
    - Bogdan Ionut Cirstea 26 Apr 2024 18:02 UTC
      3 points
      0
      Parent
      Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
      This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
      What links here?
      Bogdan Ionut Cirstea's comment on Schelling game evaluations for AI control by Olli Järviniemi (10 Oct 2024 14:42 UTC; 2 points)