The data wall discussion in the podcast applies Chinchilla’s 20 tokens/parameter too broadly and doesn’t account for repetition of data in training. These issues partially cancel out, but new information on these ingredients would affect the amended argument differently. I wrote up the argument as a new post.
The data wall discussion in the podcast applies Chinchilla’s 20 tokens/parameter too broadly and doesn’t account for repetition of data in training. These issues partially cancel out, but new information on these ingredients would affect the amended argument differently. I wrote up the argument as a new post.