Daniel Kokotajlo comments on Thane Ruthenis’s Shortform

Daniel Kokotajlo 3 Feb 2025 1:09 UTC
LW: 4 AF: 3
0
AF
I like to distinguish between “Subtle strategic bias” and “Steganography.” Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.
Example of subtle strategic bias: The model knows it’s misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. “I’m sorry sir I just am not making much progress,” it says. You look at the CoT, and it’s just doing a bunch of sensible but fruitless lines of reasoning.

Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing—but it doesn’t know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can’t do that in a single forward pass. So it needs to pass information forward to its future self somehow… without humans reading the CoT knowing that’s what it’s doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.
- Thane Ruthenis 3 Feb 2025 1:54 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I don’t see why it would be harder to do.
  Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, “I need to design unit tests for this module”. Each word of this CoT is mapped to some internal representation, and the internal representation consists of “on-task” human-intended reasoning $X$ , and misaligned scheming $Y$ (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). “I” is mapped to $⟨ X_{1}, Y_{1} ⟩$ , “need” is mapped to $⟨ X_{2}, Y_{2} ⟩$ , etc. Each $X_{i}$ iterates on $X_{i - 1}$ , each $Y_{i}$ iterates on $Y_{i - 1}$ , we have a sequence of serial computations.
  In other words: as the model outputs, “I need to design unit tests for this module”, there’s an unspoken subtext of “I need to sabotage the unit tests here”, and the model packs nine words’ worth of serial computations into both the “textual” $X$ and the “subtextual” $Y$ aspects of it.
  Now you use the Paraphraser on the string, generating alternatives like “the next step is to generate some unit tests” and “several unit tests for this module are needed here”. The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the $i$ th token of this phrase is mapped to $X_{i}$ , as if the paraphrasing didn’t happen, but that the mapping of $i \to Y_{i}$ would be broken.
  Except, why would that happen? What privileges $X$ -representations over $Y$ -representations? Why would $i \to X_{i}$ be preserved whereas $i \to Y_{i}$ is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model’s capabilities)?
  In other words: We’re basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn’t this just map the subtextual “I need to sabotage the unit tests here” to similarly semantically-equivalent-but-syntactically-different “let us compromise the relevant unit-test suite” or “I shall ensure the unit tests don’t work properly”?
  I expect it’d just lead, to wit, to the model’s computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
  It wouldn’t interfere with its ability to engage in scheming-relevant serial computations, much like it doesn’t interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
  - Daniel Kokotajlo 3 Feb 2025 6:06 UTC
    LW: 4 AF: 3
    0
    AF Parent
    I don’t think I understand. Let’s work through an example.
    
    The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let’s say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let’s say, requires 300 serial steps of reasoning. The model’s strategy is to write subtly-vulnerable code but in a way that looks like it’s trying to write good code and just getting a bit unlucky and taking longer than usual (let’s suppose the humans have a poor ability to judge how long it’s supposed to take, but still some ability, such that it can’t take e.g. 10x longer without arousing suspicion).
    
    If there’s no paraphraser, this is ‘easy.’ I mean it’s still difficult but it’s ‘easy’ in the sense that the AI can basically just use a code—each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can ‘add extra layers of meaning’ to the nuggets—in addition to their meaning in the surface level i’m-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.
    
    But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if ‘I should generate unit tests now’ and all semantically equivalent variants of that idea are assigned the hidden meaning ‘I should check to see if the vulnerability is too easy to spot’ for example, then… what is it going to write next? It’s too constrained; in order to look non-suspicious there is a limited set of options it can follow up with… ok, hmm, I’m at the limits of my understanding here. But maybe you get the idea?
    
    …here’s an intuition I think I’m drawing from. Suppose you’re the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy—just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don’t have either chain! That’s the whole point—this is a task that you haven’t done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer.
    - Thane Ruthenis 3 Feb 2025 7:02 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Ah, I think I see my error.
      Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
      And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
      So yep, you’re right, this does force all scheming to be taking place within one forward pass.