Joe Collman comments on Circumventing interpretability: How to defeat mind-readers

Joe Collman 10 Aug 2024 13:58 UTC
LW: 2 AF: 1
0
AF
Information flow might be inhibited actively through an agent’s actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.
It seems to me that there’s no fixed notion of “active” that works for both paragraphs here.
If active means [is achieved through the agent’s actions], then this does not in general imply that it is deliberately achieved through the agent’s actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of ‘active’ the first bolded section doesn’t hold: this can happen even when the agent’s thoughts are entirely visible.
If instead active means [is achieved deliberately through the agent’s actions], then the “But some methods can only be active” doesn’t hold.
There are two dimensions here:
- Whether the circumvention is implemented passively/actively.
- Whether the circumvention is selected for passively/actively.
In particular, the mechanisms you’ve labelled “strictly active” can, in principle, be selected for passively—so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).
- Lee Sharkey 11 Aug 2024 13:26 UTC
  LW: 2 AF: 1
  0
  AF Parent
  So I believe I had in mind “active means [is achieved deliberately through the agent’s actions]”.
  
  I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn’t make this distinction at the time is because the difference is pretty subtle.
  
  The mechanisms I labelled as “strictly active” are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they’ve been selected for passively.
  
  So I’d argue that they all need to be implemented actively (if they’re to have a reasonable probability of success) but may be selected for passively or actively. I’m curious if you agree with this? If not, then I may have missed your argument.
  
  (
  For the convenience of other readers of this thread, the kind of strategies I labelled as strictly active are:
  - 1.1.3. Environmental artifacts that corrupt its future self: A misaligned AI may leave behind artifacts that serve as adversarial examples or poisoned data that corrupt its future self. (Strictly active)
  - 1.1.4. Deleting (and relearning) dangerous representations: When not under surveillance, an AI might be able to delete dangerous representations so that it looks safe when surveillance resumes. It might also be able to schedule when it relearns those representations. (Strictly active)
  - 1.1.5. Weights poisoning: Instead of providing manipulated inputs, the AI might manipulate its own weights so that a small subset of normal inputs behave like adversarial examples. (Strictly active)
  - 1.1.6. Pre-committing to following potentially dangerous instructions: Hubinger: ”… if the model has the ability to read and write to some sort of state … it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.” (Strictly active)
  - 1.1.7. Hard-coded neural shortcuts: Hubinger again: “… it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. … not because it’s trying to act deceptive, but just because it’s a hard-coded part of its algorithm” (Strictly active)
  - 1.1.8. Following cached thoughts: Nanda: “[An AI might develop] the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment.” (Strictly active)
  - 1.1.9. Instrumental incapability: A misaligned AI that’s able to modify its own parameters could damage capabilities that serve some goals (e.g. aligned goals) more than its misaligned goals. Generalizes instrumental clumsiness. (Strictly active)
  - 1.1.10. Instrumental clumsiness: An AI might hack its future self so that it is more likely to make certain kinds of mistakes while leaving other capabilities intact. (Strictly active)
  - 1.3.3. Interpretability illusions: By manipulating the dataset shown to the tool, an AI can control its interpretations. (Strictly active)
  - 1.4.1. Adversarial attack against networks in the interpretation tool: The AI could adversarially attack networks in the tool to manipulate the tool’s outputs. (Strictly active)
  - 2.4.1. Manipulating, mind-hacking, or incapacitating the interpreter: An AI might be able to manipulate, directly control, or incapacitate humans using specific kinds of outputs. (Strictly active)
  )
  - Joe Collman 11 Aug 2024 14:52 UTC
    LW: 2 AF: 1
    0
    AF Parent
    To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we’re not training against tools, the things I’m highlighting here seem intuitively highly unlikely.
    This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].
    That said, I think I’d disagree on one word of the following:
    The mechanisms I labelled as “strictly active” are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to that help orchestrate the actions required to do it. This is true even if they’ve been selected for passively.
    Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don’t need to be used for the purpose of orchestrating the actions. They don’t need to be [clearly selected for the purpose of orchestrating such actions].
    In some cases this might be improbable without active, deliberate malign selection—but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn’t seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it’s enough for a particular malign strategy would vary)
    I haven’t thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it’s implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).
    If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not ‘instinctive’ malign strategies, I’d strongly expect the outcome to be ‘instinctive’ active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)
    Whether I’d expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].
    My current answer to this is not [this is extremely improbable]. It’s more like [This seems intuitively plausible; I have no idea on the probability so I’m going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].