Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.
We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.
Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.
We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.