quetzal_rainbow comments on Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility “real”

quetzal_rainbow 24 Nov 2022 10:14 UTC
6 points
2
I have a kinda symmetric feeling about “practical” research. “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”? (I understand this feeling is not rational, it just shows my general inclination towards “theoretical” work)
- Zac Hatfield-Dodds 24 Nov 2022 20:44 UTC
  14 points
  11
  Parent
  “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”?
  
  I understand this is more an illustration than a question, but I’ll try answering it anyway because I think there’s something informative about different perspectives on the problem :-)
  
  Skip-trigrams are a foundational piece of induction heads, which are themselves a key mechanism for in-context learning. A Mathematical Framework for Transformer Circuits was published less than a year ago, IMO subsequent progress is promising, and mechanistic interpretability has been picked up by independent researchers and other labs (e.g. Redwood’s project on GPT-2-small).
  
  Of course the skip-trigram result isn’t itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it’s analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly—actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.^[1]
  
  Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don’t know how to build—and verify that we’ve correctly built—the thing, we’re still screwed; conversely if we get really good at building stuff and verifying our expectations but don’t expect some edge-case like FDT-based cooperation then we’re still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.
  ↩︎
  I still think it’s pretty unlikely (credible interval 0--40%) that we’ll have good enough interpretability tools by the time we really really need them, but I don’t see any mutually-exclusive options which are better.
  What links here?
  - Zac Hatfield-Dodds's comment on Concrete Reasons for Hope about AI by Zac Hatfield-Dodds (15 Jan 2023 13:05 UTC; 7 points)
  - LawrenceC 25 Nov 2022 11:51 UTC
    3 points
    0
    Parent
    Nitpick:
    IMO subsequent progress is promising,
    This link probably meant to go to the induction heads and in context learning paper?
    - Zac Hatfield-Dodds 25 Nov 2022 17:25 UTC
      2 points
      0
      Parent
      Fixed, thanks; it links to the transformer circuits thread which includes both the induction heads paper, SoLU, and Toy Models of Superposition.

quetzal_rainbow comments on Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/​Corrigibility “real”

quetzal_rainbow comments on Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility “real”