eggsyntax comments on Why imperfect adversarial robustness doesn’t doom AI control

eggsyntax 19 Nov 2024 13:23 UTC
LW: 3 AF: 1
1
AF
the untrusted model is stateless between queries and only sees the command history and system state.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you’re assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
- Buck 19 Nov 2024 17:56 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn’t second guess it, I agree it’s wrong.
  - Aaron_Scher 23 Nov 2024 2:39 UTC
    2 points
    1
    Parent
    Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
    I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
  - eggsyntax 19 Nov 2024 18:35 UTC
    LW: 2 AF: 1
    0
    AF Parent
    It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
    I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
    - Buck 19 Nov 2024 21:07 UTC
      LW: 8 AF: 4
      5
      AF Parent
      It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
      Those sentences are talking in the context of defer to resample, where I think it’s actually reasonable to think the model won’t see the resampled actions.
      I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?
      I’m specifically talking about not showing it actions that were resampled. You’re totally right that they’ll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
      Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
      Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
      - eggsyntax 19 Nov 2024 21:20 UTC
        2 points
        0
        Parent
        I see—I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!
    - eggsyntax 19 Nov 2024 19:51 UTC
      2 points
      0
      Parent
      One story here could be that the model is being used only in an API context where it’s being asked to take actions on something well-modeled as a Markov process, where the past doesn’t matter (and we assume that the current state doesn’t incorporate relevant information about the past). There are certainly use cases that fit that (‘trigger invisible fence iff this image contains a dog’; ‘search for and delete this file’). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
      Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn’t apply to current LLMs but seems like a very plausible path for economically useful models like agents.