I have a kinda symmetric feeling about “practical” research. “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”?
(I understand this feeling is not rational, it just shows my general inclination towards “theoretical” work)
“Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”?
I understand this is more an illustration than a question, but I’ll try answering it anyway because I think there’s something informative about different perspectives on the problem :-)
Of course the skip-trigram result isn’t itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it’s analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly—actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.[1]
Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don’t know how to build—and verify that we’ve correctly built—the thing, we’re still screwed; conversely if we get really good at building stuff and verifying our expectations but don’t expect some edge-case like FDT-based cooperation then we’re still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.
I still think it’s pretty unlikely (credible interval 0--40%) that we’ll have good enough interpretability tools by the time we really really need them, but I don’t see any mutually-exclusive options which are better.
I have a kinda symmetric feeling about “practical” research. “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”? (I understand this feeling is not rational, it just shows my general inclination towards “theoretical” work)
I understand this is more an illustration than a question, but I’ll try answering it anyway because I think there’s something informative about different perspectives on the problem :-)
Skip-trigrams are a foundational piece of induction heads, which are themselves a key mechanism for in-context learning. A Mathematical Framework for Transformer Circuits was published less than a year ago, IMO subsequent progress is promising, and mechanistic interpretability has been picked up by independent researchers and other labs (e.g. Redwood’s project on GPT-2-small).
Of course the skip-trigram result isn’t itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it’s analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly—actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.[1]
Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don’t know how to build—and verify that we’ve correctly built—the thing, we’re still screwed; conversely if we get really good at building stuff and verifying our expectations but don’t expect some edge-case like FDT-based cooperation then we’re still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.
I still think it’s pretty unlikely (credible interval 0--40%) that we’ll have good enough interpretability tools by the time we really really need them, but I don’t see any mutually-exclusive options which are better.
Nitpick:
This link probably meant to go to the induction heads and in context learning paper?
Fixed, thanks; it links to the transformer circuits thread which includes both the induction heads paper, SoLU, and Toy Models of Superposition.