Vaniver comments on Steganography in Chain of Thought Reasoning

Vaniver 8 Aug 2022 10:32 UTC
LW: 2 AF: 1
0
AF
For fixed-duration reasoning chains, or in situations where the AI is trying to facilitate cooperation between different passes (like in the proposed scenario where it gets information but then loses access to it), this seems important. For example, in a situation where you show the model a passage, let it make notes, and then have the same model look just at the notes and answer a reading comprehension question, I expect that likely it will want to embed information in the null space of the notes. [Or in an HCH-style situation, it might want to embed the original question into subquestions, and might do so steganographically, in a way that breaks some of the things one might be hoping for there.]
But I’m somewhat confused about whether you should expect it to crop up in recursive prompting cases (probably you don’t?). If you secretly know the answer in the first pass, don’t you want to just say the answer and terminate? If you secretly know a relevant fact in the first pass, don’t you want to openly express it and recur? I don’t see how steganography is rewarded over openness. [I’m setting aside the case where it invents a new language to express non-linguistic concepts more cheaply, and assuming we’re in the natural language setting where standard language is the ‘natural’ way to transmit the information.]
- A Ray 8 Aug 2022 16:46 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven’t organized:
  - In general additional compute time lets you improve your answer, so by default I’d expect answers to any significant question to take multiple steps
    I agree for trivial questions it might not do any of this. My expectation is that AI deployments impacts will mostly be on things like non-trivial questions or tasks
  - So far current research hasn’t done a great deal of regularizing or penalizing models for using a lot of thinking
    Mostly they’ve optimized for “maximize the chance of correct answer”
    This could change in the future, especially in cases where the model is expected to be deployed / otherwise sensitive to runtime costs
  - Steganography is rewarded over openness if the openness is somehow far from the prior for a given piece of information it wants to transmit
    Caveat this point is more speculative than the above document
    One of my points is that due to null space + optimization pressure, it’s very easy and convenient for gradient descent to put information in these hidden dimensions
    I think one of the things this can do well is handle concepts that don’t yet exist—and so reasoning moves like “invent a new term” or “give an illustrative example” or “invent a thought experiment” are lower likelihood under the prior
    In these cases, it can be more likely for steganography to transmit certain information, if it is less expensive in terms of KL penalties
    I do agree that for straightforward reasoning on straightforward evidence, where the reasoning is human-like and close to the prior, we can expect it to mostly reason out loud in a way that is “open”
  - Myopia—though less important than it was earlier in RFLO research—seemed like a good alignment criterion to aim for, and still does today somewhat
    The original starting point for this direction of thinking was about Myopia / Non-Myopia in chain of thought reasoning systems
    The sorts of guarantees I was hoping to get for chain of thought systems included myopia guarantees
    Steganography came up as an example of how it could break myopia (other more complicated examples include side effects and situational awareness)
    The concepts in this document came up by asking “how might steganography arise naturally given current research techniques and directions”