A Ray comments on Steganography in Chain of Thought Reasoning

A Ray 8 Aug 2022 16:46 UTC
LW: 4 AF: 3
0
AF
I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven’t organized:
- In general additional compute time lets you improve your answer, so by default I’d expect answers to any significant question to take multiple steps
  - I agree for trivial questions it might not do any of this. My expectation is that AI deployments impacts will mostly be on things like non-trivial questions or tasks
- So far current research hasn’t done a great deal of regularizing or penalizing models for using a lot of thinking
  - Mostly they’ve optimized for “maximize the chance of correct answer”
  - This could change in the future, especially in cases where the model is expected to be deployed / otherwise sensitive to runtime costs
- Steganography is rewarded over openness if the openness is somehow far from the prior for a given piece of information it wants to transmit
  - Caveat this point is more speculative than the above document
  - One of my points is that due to null space + optimization pressure, it’s very easy and convenient for gradient descent to put information in these hidden dimensions
  - I think one of the things this can do well is handle concepts that don’t yet exist—and so reasoning moves like “invent a new term” or “give an illustrative example” or “invent a thought experiment” are lower likelihood under the prior
  - In these cases, it can be more likely for steganography to transmit certain information, if it is less expensive in terms of KL penalties
  - I do agree that for straightforward reasoning on straightforward evidence, where the reasoning is human-like and close to the prior, we can expect it to mostly reason out loud in a way that is “open”
- Myopia—though less important than it was earlier in RFLO research—seemed like a good alignment criterion to aim for, and still does today somewhat
  - The original starting point for this direction of thinking was about Myopia / Non-Myopia in chain of thought reasoning systems
  - The sorts of guarantees I was hoping to get for chain of thought systems included myopia guarantees
  - Steganography came up as an example of how it could break myopia (other more complicated examples include side effects and situational awareness)
  - The concepts in this document came up by asking “how might steganography arise naturally given current research techniques and directions”