About 4T parameters, which is 8 TB in BF16. With about 100x more compute (compared to Llama 3 405B), we get a 10x larger model by Chinchilla scaling, the correction from a higher tokens/parameter ratio is relatively small (and in this case cancels out the 1.5 factor in compute being 150x actually).
Not completely sure if BF16 remains sufficient at 6e27-5e28 FLOPs, as these models will have more layers and larger sums in matrix multiplications. If BF16 doesn’t work, the same clusters will offer less compute (at a higher precision). Seems unlikely though, as 3 OOMs of compute only increase model size 30x, which means 3x more layers and 3x larger matrices (in linear size), which is not that much. There are block number formats like microscaling that might help if this is somehow a problem, but usability of this remains unclear, as everyone is still training in BF16 in practice.
In the other direction, there is a Nov 2024 paper that suggests 7-8 bit precision might be compute optimal at any scale, that the proper way to adapt to scale is by increasing the number of parameters rather than increasing precision (Section 4.3.2). If this can be made practical at a given scale, there’ll be 2x more compute, and even more in effective compute, which is essentially the paper’s claim. (I don’t know how this interacts with scarce data, possibly either higher or lower precision can improve the situation.)
About 4T parameters, which is 8 TB in BF16. With about 100x more compute (compared to Llama 3 405B), we get a 10x larger model by Chinchilla scaling, the correction from a higher tokens/parameter ratio is relatively small (and in this case cancels out the 1.5 factor in compute being 150x actually).
Not completely sure if BF16 remains sufficient at 6e27-5e28 FLOPs, as these models will have more layers and larger sums in matrix multiplications. If BF16 doesn’t work, the same clusters will offer less compute (at a higher precision). Seems unlikely though, as 3 OOMs of compute only increase model size 30x, which means 3x more layers and 3x larger matrices (in linear size), which is not that much. There are block number formats like microscaling that might help if this is somehow a problem, but usability of this remains unclear, as everyone is still training in BF16 in practice.
In the other direction, there is a Nov 2024 paper that suggests 7-8 bit precision might be compute optimal at any scale, that the proper way to adapt to scale is by increasing the number of parameters rather than increasing precision (Section 4.3.2). If this can be made practical at a given scale, there’ll be 2x more compute, and even more in effective compute, which is essentially the paper’s claim. (I don’t know how this interacts with scarce data, possibly either higher or lower precision can improve the situation.)