StefanHex comments on A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex 20 Feb 2023 21:38 UTC
3 points
0
Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.

We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.