Added to the ‘Evidence for generality’ section after discovering this paper:
One other thing worth noting is that we know from ‘The Expressive Power of Transformers with Chain of Thought’ that the transformer architecture is capable of general reasoning (though bounded) under autoregressive conditions. That doesn’t mean LLMs trained on next-token prediction learn general reasoning, but it means that we can’t just rule it out as impossible.
Added to the ‘Evidence for generality’ section after discovering this paper: