Bogdan Ionut Cirstea comments on Thoughts on sharing information about language model capabilities

Bogdan Ionut Cirstea 5 Dec 2024 16:53 UTC
2 points
−2
I don’t dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.
- habryka 5 Dec 2024 17:02 UTC
  4 points
  0
  Parent
  What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?
  - Bogdan Ionut Cirstea 5 Dec 2024 17:58 UTC
    2 points
    0
    Parent
    The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section ‘B.1.3 HIDDEN SCHEMING REASONING’ from Towards evaluations-based safety cases for AI scheming. I wouldn’t be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section ‘B.1.2 OBFUSCATED SCHEMING REASONING’ of Towards evaluations-based safety cases for AI scheming.