LawrenceC comments on A circuit for Python docstrings in a 4-layer attention-only transformer

LawrenceC 20 Feb 2023 20:42 UTC
3 points
0
Cool work, thanks for writing it up and posting!
We selected this behaviour because a 4-layer attention-only toy model could do the task while a 3-layer one could not.
I’m a bit confused why this happens, if the circuit only “needs” three layers of composition. Relatedly, do you have thoughts on why head 1.4 implements both the induction behavior and the fuzzy previous token behavior?
- StefanHex 20 Feb 2023 21:38 UTC
  3 points
  0
  Parent
  Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.
  
  We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.
- Neel Nanda 20 Feb 2023 21:18 UTC
  2 points
  1
  Parent
  
  I’m a bit confused why this happens, if the circuit only “needs” three layers of composition
  
  I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn’t surprise me if it just didn’;t have enough data or capacity in 3L, even though it was technically capable.
  - LawrenceC 20 Feb 2023 21:31 UTC
    2 points
    0
    Parent
    Ah, that makes sense!