I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).
I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).
The B they put at the end is a bit weird though.