Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al
Nice, this is explained under Figure 5, in particular
The loss varies only a few percent over a wide range of shapes. [...] an (nlayer, dmodel) = (6, 4288) reaches a loss within 3% of the (48, 1600) model
(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)
Nice, this is explained under Figure 5, in particular
(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)