Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 16 Dec 2024 18:30 UTC
79 points
17
I really should have blogged more about faithful CoT as soon as I left OpenAI, or better yet, while I was working on it there in 2023. Oh well. A snippet I could have said but didn’t:

A core reason to be excited about faithful CoT as a big-picture alignment agenda is that it can be pursued as a builder/breaker game, i.e. we have clear empirical feedback / metrics we can progress on. Builders try to design AI systems that are interpretable / have faithful CoT properties; breakers then fine-tune those systems to see if they are capable of learning steganography, subtle biases, deception, etc. (i.e. breaking those faithfulness properties).
- Builder example: Shoggoth/face + paraphraser
  - Visible-to-reward-system scratch pad → Secret scratch pad → Shoggoth/face
  - Recurrency or other voodoo → tokens → tokens-passed-through-paraphraser-bottleneck
- Breaker example: DC evals for steganography, subtle bias, etc.
  - Ideally we’d just instruct it to conceal stuff in the secret scratch pad, and see if it can.
  - But we should do more serious elicitation involving fine-tuning etc.
- Goal: Stay in the regime longer as our systems get smarter, hopefully resulting in more learning opportunities…
- Daniel Kokotajlo 17 Dec 2024 18:14 UTC
  14 points
  0
  Parent
  For many cases, the answer will be “Yes, the property can be broken by a red-team specifically fine-tuning the model to break the property.” Then the question becomes “How much optimization effort did they have to put in to achieve this?” and “Is it plausible that the AI we are planning to actually deploy has had a similar or greater amount of optimization accidentally training it to break this property?”
  - Oliver Daniels 17 Dec 2024 19:10 UTC
    3 points
    0
    Parent
    The builder-breaker thing isn’t unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like “activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization”.
    - Daniel Kokotajlo 17 Dec 2024 19:11 UTC
      3 points
      0
      Parent
      Correct. There are several other agendas that have this nice property too.
- sjadler 18 Dec 2024 18:05 UTC
  3 points
  0
  Parent
  In case anybody’s looking for steganography evals—my team built and open-sourced some previously: https://github.com/openai/evals/blob/main/evals/elsuite/steganography/readme.md
  
  This repo is largely not maintained any longer unfortunately, and for some evals it isn’t super obvious how the new paradigm for O1 affects them (for instance, we had built solvers/scaffolding for private scratchpads, but now having a private CoT provides this out-of-the-box and so might interact with this strangely). But still perhaps worth a look