We selected this behaviour because a 4-layer attention-only toy model could do the task while a 3-layer one could not.
I’m a bit confused why this happens, if the circuit only “needs” three layers of composition. Relatedly, do you have thoughts on why head 1.4 implements both the induction behavior and the fuzzy previous token behavior?
Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.
We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.
I’m a bit confused why this happens, if the circuit only “needs” three layers of composition
I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn’t surprise me if it just didn’;t have enough data or capacity in 3L, even though it was technically capable.
Cool work, thanks for writing it up and posting!
I’m a bit confused why this happens, if the circuit only “needs” three layers of composition. Relatedly, do you have thoughts on why head 1.4 implements both the induction behavior and the fuzzy previous token behavior?
Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.
We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.
I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn’t surprise me if it just didn’;t have enough data or capacity in 3L, even though it was technically capable.
Ah, that makes sense!