habryka comments on Thoughts on sharing information about language model capabilities

habryka 5 Dec 2024 16:28 UTC
2 points
0
I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).

In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.
- Bogdan Ionut Cirstea 5 Dec 2024 16:37 UTC
  2 points
  −3
  Parent
  I’m pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can’t even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn’t seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So ‘chain-of-thought but internalized to the model will take over’ seems very unlikely with transformers, and much more so if basic mitigations like unlearning (e.g. of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, to mitigate the possibility of latent scheming) were applied.
  Steganography is a separate threat model, but even there I’d interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).
  - habryka 5 Dec 2024 16:48 UTC
    3 points
    0
    Parent
    Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions.
    
    Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.
    - Bogdan Ionut Cirstea 5 Dec 2024 16:53 UTC
      2 points
      −2
      Parent
      I don’t dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.
      - habryka 5 Dec 2024 17:02 UTC
        4 points
        0
        Parent
        What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?
        Bogdan Ionut Cirstea 5 Dec 2024 17:58 UTC
        2 points
        0
        Parent
        The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section ‘B.1.3 HIDDEN SCHEMING REASONING’ from Towards evaluations-based safety cases for AI scheming. I wouldn’t be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section ‘B.1.2 OBFUSCATED SCHEMING REASONING’ of Towards evaluations-based safety cases for AI scheming.