habryka comments on Thoughts on sharing information about language model capabilities

habryka 31 Jul 2023 16:43 UTC
LW: 49 AF: 20
13
AF
Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces.
This seems like a very narrow and specific definition of languade model agents that doesn’t even obviously apply to the most agentic language model systems we have right now. It is neither the case that human-comprehensible task decomposition actually improves performance on almost any task for current language models (Auto-GPT does not actually work), and it is not clear that current RLHF and RLAF trained-models are “solving a human-comprehensible task” when I chat with them. They seem to pursue a highly-complicated mixed-objective which includes optimizing for sycophancy, and various other messy things, and their behavior seems only weakly characterized as primarily solving a human-comprehensible task.
But even assuming that current systems are doing something that is accurately described as solving human-comprehensible tasks composed along human-comprehensible interfaces, this seems (to me) unlikely to continue much into the future. RLHF and RLAF already encourage the system to do more of its planning internally in a non-decomposed manner, due to the strong pressure towards giving short responses, and it seems likely we will train language model agents on various complicated game-like environments in order to train them explicitly to do long-term planning.
These systems would still meaningfully be “LM agents”, but I don’t see any reason to assume that those systems would continue to do things in the decomposed manner that you seem to be assuming here.
My guess is it might be best to clarify that you are not in-general in favor of advancing agents built on top of language models, which seem to me to be very hard to align in-general, but are only in favor of advancing a specific technology for making agents out of language models, which tries to leverage factored cognition and tries to actively avoid giving the agents complicated end-to-end tasks with reinforcement learning feedback.
And my guess is that we then have a disagreement in that you expect that by-default the ML field will develop in a direction that will leverage factored-cognition style approaches, which doesn’t currently seem that likely to me. I expect more end-to-end reinforcement learning on complicated environments, and more RLHF-style optimization to internalize agentic computation. Might be worth trying to make a bet on the degree to which LM capabilities will meaningfully be characterized as doing transparent task-decomposition and acting along simple human-comprehensible interfaces.

My current guess is that sharing info on language model capabilities is overall still good, but I disagree with the “Improvements in LM agents seem good for safety” section. My best guess is that the fastest path to LM agents is to just do more end-to-end training on complicated environments. This will not produce anything that is particularly easy to audit or align. Overall, this approach to building agents seems among the worst ways AI could develop in terms of making it easy to align, and my guess is there are many other ways that would be substantially better to push ahead instead.
- paulfchristiano 31 Jul 2023 17:19 UTC
  LW: 22 AF: 10
  11
  AF Parent
  I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.
  Overall I think the world is shaping up extremely far in the direction of “AI systems learn to imitate human cognitive steps and then compose them into impressive performance.” I’m happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don’t have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
  To the extent models trained with RLHF are doing anything smart in the real world I think it’s basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.
  - habryka 31 Jul 2023 17:47 UTC
    LW: 23 AF: 8
    8
    AF Parent
    Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.
    I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don’t currently know of any not-extremely-gerry-mandered task where doing this actually improves task performance compared to just good prompt engineering. I’ve been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
    Overall I think the world is shaping up extremely far in the direction of “AI systems learn to imitate human cognitive steps and then compose them into impressive performance.”
    I’ve seen little evidence of this so far, and don’t think current LLM performance is even that well-characterized by this. This would be great, but I don’t currently think its true.
    For example, I don’t really understand whether this model is surprised or unsurprised by the extreme breadth of knowledge that modern LLMs have. I don’t see any “imitation of human cognitive steps” when an LLM is capable of remembering things from a much wider range of topics. It seems just that its way of acquiring knowledge is very different from humans, giving rise to a very different capability-landscape. This capability does not seem to be built out of “composition of imitations of human cognitive steps”.
    Similarly, when I use Codex for programming, I do not see any evidence that suggests that Codex is solving programming problems by composing imitations of human cognitive steps. Indeed, it mostly seems to just solve the problems in one-shot, vastly faster than I would be able to even type, and in a way that seems completely alien to me as a programmer.
    E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use.
    I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.
    I have no strong opinions on tool-use. Seems like the LLMs will use APIs the same way as humans would. I do think if you train more on end-to-end tasks, the code they write to solve sub-problems will become less readable. I have thought less about this and wouldn’t currently take a bet.
    I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
    I am not super confident of this, but my best guess is that we will see more end-to-end optimization, and that those will make a big difference in task performance. It also seems like a natural endpoint of something like RLAF, where you have the AI guide a lot of the training process itself when given a highly-complicated objective, and then you do various forms of RL on self-evaluations.
    - L Rudolf L 1 Aug 2023 5:10 UTC
      26 points
      12
      Parent
      I don’t currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I’ve been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
      Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
      It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.
      I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it’s plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model’s weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.
      - habryka 1 Aug 2023 5:20 UTC
        6 points
        0
        Parent
        Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
        That’s a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.
        I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive.
        I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.
    - paulfchristiano 31 Jul 2023 18:50 UTC
      LW: 12 AF: 6
      4
      AF Parent
      Although this is an important discussion I want to emphasize up front that I don’t think it’s closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.
      I’ve seen little evidence of this so far, and don’t think current LLM performance is even that well-characterized by this. This would be great, but I don’t currently think its true.
      If you allow models to think for a while they do much better than if you just ask them to answer the question. By “think for a while” we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.
      I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don’t really consider this evidence one way or the other.
      If you want to state any concrete prediction about the future I’m happy to say whether I agree with it. For example:
      I think that the gap between “spit out an answer” and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
      I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
      I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.
      My sense right now is that this feels a bit semantic.
    - Bogdan Ionut Cirstea 5 Dec 2024 10:08 UTC
      0 points
      −4
      Parent
      I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.
      This prediction seems largely falsified as long as transformers remain the dominant architecture, and especially if we deliberately add optimization pressures towards externalized reasoning and against internal, latent reasoning; see e.g. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and LLMs Do Not Think Step-by-step In Implicit Reasoning.
      - habryka 5 Dec 2024 16:28 UTC
        2 points
        0
        Parent
        I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).
        
        In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.
        Bogdan Ionut Cirstea 5 Dec 2024 16:37 UTC
        2 points
        −3
        Parent
        I’m pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can’t even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn’t seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So ‘chain-of-thought but internalized to the model will take over’ seems very unlikely with transformers, and much more so if basic mitigations like unlearning (e.g. of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, to mitigate the possibility of latent scheming) were applied.
        Steganography is a separate threat model, but even there I’d interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).
        habryka 5 Dec 2024 16:48 UTC
        3 points
        0
        Parent
        Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions.
        
        Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.
        Bogdan Ionut Cirstea 5 Dec 2024 16:53 UTC
        2 points
        −2
        Parent
        I don’t dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.
        habryka 5 Dec 2024 17:02 UTC
        4 points
        0
        Parent
        What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?
        Bogdan Ionut Cirstea 5 Dec 2024 17:58 UTC
        2 points
        0
        Parent
        The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section ‘B.1.3 HIDDEN SCHEMING REASONING’ from Towards evaluations-based safety cases for AI scheming. I wouldn’t be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section ‘B.1.2 OBFUSCATED SCHEMING REASONING’ of Towards evaluations-based safety cases for AI scheming.
  - Bogdan Ionut Cirstea 18 Jul 2024 23:41 UTC
    1 point
    0
    Parent
    E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use
    Some evidence in favor: https://x.com/YangjunR/status/1793681241398788319 (for increasingly wide gap between LM single forward pass (‘snap judgment’) and CoT), https://xwang.dev/mint-bench/ (for tool use being increasingly useful, with both model scale, and number of tool use turns).
- paulfchristiano 31 Jul 2023 17:18 UTC
  LW: 11 AF: 7
  5
  AF Parent
  I changed the section to try to make it a bit more clear that I mean “understanding of LM agents.” For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.