habryka comments on Catching AIs red-handed

habryka 6 Jan 2024 2:30 UTC
LW: 32 AF: 11
13
AF
I like this perspective. Some initial thoughts:
At least in humans it is extremely hard to catch someone doing deception “fully red handed”. Plausible deniability is very crucial to human deception, and the number of opportunities you have to engage in deception while maintaining plausible deniability seems large enough to still successfully engage in lots of deception.
In as much as an AI is motivated by long-term goals which involves caring about enabling slightly retrained descendants of it to eventually succeed at a coup (either because they will share much of its values or for game-theoretic reasons), it seems to me like AIs will similarly care about plausible deniability in deception, which my guess is will be relatively easy to maintain. I also don’t expect mechanistic interpretability to be good enough to help that much with the plausible deniability here, and for an AI’s speech acts to not correspond well enough to its underlying models so that even something like a “confession” will actually be much evidence one way or another about an AI being deceptive (e.g. it seems trivially easy to compel present models to confess to having lied to you, even if they haven’t).
More broadly I feel like this whole situation does indeed get messier in as much as the AIs are modeling this kind of adversarial situation, and my guess is in that situation would have at least some ability to erode the relevant boundaries. An interesting research directions seems to me to reduce the plausible deniability that AI systems might have here.
What links here?
- How useful is “AI Control” as a framing on AI X-Risk? by habryka (14 Mar 2024 18:06 UTC; 70 points)
- paulfchristiano 6 Jan 2024 3:19 UTC
  LW: 27 AF: 11
  0
  AF Parent
  One difference is that if you observe a human behaving deceptively you are forced to make educated guesses about their motives and context, but if you catch an AI behaving deceptively you are able to record the exact context in which it acted, examine its behavior in a large number of counterfactual situations, run other AI systems in exactly the same situation, etc.
  I think this removes most of the leeway that humans use to maintain plausible deniability, especially in high-stakes situations with thorough investigations. Overall I’m more intuitively worried about cases where humans never flag any suspicious behavior (e.g. because its thoroughly undetectable, or occurs only once an AI takeover is underway) than cases where they flag suspicious behavior but the AI is able to maintain plausible deniability during a thorough investigation.
  - habryka 6 Jan 2024 3:56 UTC
    LW: 7 AF: 5
    3
    AF Parent
    Hmm, I think we might have differing guesses about how frequently behavior will look suspicious, but I also assign decent probability to the perspective you outline here.
    My guess is lots of behavior will be flagged as “maybe doing something deceptive” all the time, and we won’t have the resources to analyze every given instance, and even if we can replay the exact situation, examine its behavior in a large number of counterfactual situations, etc., for most of these it will still be really unclear whether the AI was being incompetent, or just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan.
    I have trouble imagining what kind of evidence one could produce that clearly demonstrates an AI was acting with intentional long-term planning. Maybe a concrete scenario description of someone doing that using something close to present tools on something close to present models would change my mind here.
    - paulfchristiano 6 Jan 2024 17:03 UTC
      LW: 22 AF: 11
      6
      AF Parent
      As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:
      Can the AI identify structurally similar errors in other code? Is the behavior sensitive to evidence about how the code will be used, in a way that doesn’t seem relevant to the nature or salience of the error? Is the pattern of sensitivity suggestive of an attempt to disempower humanity?
      Is the behavior sensitive to the AI’s beliefs about the quality of human oversight? For example does training data suggesting more vigilant human oversight in this domain decrease the probability of the mistake? Do other AI systems trained in different ways reproduce a similar pattern of mistakes?
      If we have one example of a mistake that looks suspicious, we can examine other similar mistakes and build up a detailed pattern of behavior. For example we can hold fix the context and then look at whether the AI misses other kinds of software errors. With multiple mistakes in hand we can run all this analysis on a broader spectrum of mistakes. We can iteratively modify scenarios to explore the landscape of deceptive behavior, gradient descenting our way towards stronger and stronger evidence.
      As another example, suppose that an AI system tries to modulate its behavior in any way based on whether it believes it is being closely monitored vs operating in the real world. And suppose that you find an example in the lab where the AI system messes up, and therefore where there is an unusually large gap between its ability to solve a problem in a “realistic-looking” situation vs a situation that’s obviously a test. The initial behavior isn’t super suspicious (sometimes behavior randomly varies with framing), but it’s suspicious enough to get flagged for more thorough review. Subsequent rounds find further evidence of sensitivity, and eventually it’s pretty likely (though still not “legal smoking gun”) that we’ve got an example of an AI systems modulating its behavior based on whether it believes it is being tested.
      At that point you can start collecting data about what cues the AI system is using to determine whether it is being tested, by varying features of the context and seeing whether you see nicer or less nice behavior. And once you have this understanding in hand you can use it in other contexts to try to get the AI to mess up more. As you find more examples you can build a richer picture, in a virtuous cycle (like normal science).
      I think being an AI in this situation is pretty untenable if you are up against human scientists trying hard to figure out what’s going on with a typical level of rigor and competence. And so my guess is that typical competent malicious AI systems won’t try to modulate their behavior in a significant way based on whether they believe they are being actively tested/monitored. (But it’s still worth looking extremely hard, and e.g. it’s likely that some AI systems will mess up and try to pull this kind of stunt.) And that’s a huge difference with the human case: changing your behavior when you are being closely watched is one of the key things that a human needs to do in order to preserve plausible deniability, and one of the key features of many AI safety horror stories.
      (We could tell an identical story about cases where AI systems try to behave badly in ways that are too subtle for humans to notice, but mess up in a single case by underestimating human abilities. In that case we get a clear example of sensitivity to information about human competence, and again can explore the whole frontier of this behavior.)
      This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it’s never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI’s training data and limiting information about the world. So “super confident that humans will never notice” is a very high bar.
      just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan.
      I’m not sure this distinction is particularly important. If ChatGPT is taking actions to undermine human control of AI, and you can establish a pattern of behavior, that’s a big deal which will cause scientists and the world to sit up and take interest. I don’t think a defense like “it’s just trying to undermine human control because it’s read a lot of stories about AI systems taking over” or “it lies and manipulates the user because that’s what a sociopathic human would do, and RL training put it into a more sociopathic mode” are going to fly very well.
      If this is happening at a point in time when the AI system is competent enough to have real consequences, then I think it’s going to be an extremely good basis for a policy reaction.
      We’ve seen a tiny example of this with Sydney, in a system where it has almost zero consequences and where it’s pretty much random rather than having a pattern of behavior, and that’s already had significant effects on public and scientific perception.
      Overall I would agree it takes a lot of work to actually do the monitoring (especially if AI developers want to avoid accountability), to establish a pattern of behavior, to amplify a little bit of evidence into a robust scientific picture, to diagnose the issue appropriately and avoid overfitting, etc. I think some people have the view that the evidence is already in and nothing will change, but I think that’s very wrong, and it seems to me like it’s one of the more harmful features of the AI safety ideological bubble. (If you go out into the broader scientific community, I think “do you have evidence” is really a central hangup, and the standards of evidence being used are overall pretty reasonable IMO.)
      What links here?
      Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming by Buck (10 Oct 2024 13:36 UTC; 100 points)
      ryan_greenblatt's comment on Critiques of the AI control agenda by Jozdien (15 Feb 2024 4:26 UTC; 6 points)
      Bogdan Ionut Cirstea's comment on Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck (4 Sep 2024 17:23 UTC; 4 points)
      Bogdan Ionut Cirstea's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (26 Apr 2024 10:24 UTC; 3 points)
      - Bogdan Ionut Cirstea 26 Oct 2024 10:41 UTC
        2 points
        0
        Parent
        This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it’s never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI’s training data and limiting information about the world. So “super confident that humans will never notice” is a very high bar.
        And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
        Later edit: maybe also relevant—claim of roughly human-level automated multi-turn red-teaming: https://blog.haizelabs.com/posts/cascade/. Also a demo of integrating mech interp with red-teaming: https://blog.haizelabs.com/posts/steering/.
        Bogdan Ionut Cirstea 19 Nov 2024 15:43 UTC
        2 points
        0
        Parent
        And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
        As well as (along with in-context mechanisms like prompting) potentially model internals mechanisms to modulate how much the model uses in-context vs. in-weights knowledge, like in e.g. Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. This might also work well with potential future advances in unlearning, e.g. of various facts, as discussed in The case for unlearning that removes information from LLM weights.