Neel Nanda comments on A circuit for Python docstrings in a 4-layer attention-only transformer

Neel Nanda 20 Feb 2023 21:18 UTC
2 points
1

I’m a bit confused why this happens, if the circuit only “needs” three layers of composition

I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn’t surprise me if it just didn’;t have enough data or capacity in 3L, even though it was technically capable.
- LawrenceC 20 Feb 2023 21:31 UTC
  2 points
  0
  Parent
  Ah, that makes sense!