paulfchristiano comments on Thoughts on sharing information about language model capabilities

paulfchristiano 31 Jul 2023 18:50 UTC
LW: 12 AF: 6
4
AF
Although this is an important discussion I want to emphasize up front that I don’t think it’s closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.
I’ve seen little evidence of this so far, and don’t think current LLM performance is even that well-characterized by this. This would be great, but I don’t currently think its true.
If you allow models to think for a while they do much better than if you just ask them to answer the question. By “think for a while” we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.
I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don’t really consider this evidence one way or the other.
If you want to state any concrete prediction about the future I’m happy to say whether I agree with it. For example:
- I think that the gap between “spit out an answer” and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
- I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
- I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.
My sense right now is that this feels a bit semantic.