I have a question. Suppose we want to create a decent language model which is a small as possible—small enough to run on a cell phone, say. We could try to compensate for this by scaling data to infinity. Now, we may run out of data, but if we do, we can generate more data artificially using a much larger LM. For example, consider training something BERT-sized using artificial data generated by PaLM (assume we have a very high compute budget in the training phase).
How well should we expect this to perform? If we plug into the above, it seems like 100M parameters (the size of BERT base, I think?) is hopelessly small and will never get anywhere, whereas at 1B we might approach “almost GPT3” given infinite data, and with 10B we have a realistic shot—did I do this right? What’s the right loss to put in from the limited data, given the data is not actually limited (it’s generated by PaLM) but it’s low quality (it’s generated by PaLM instead of being “real”)?
Also, is 1B parameters equal to around 4GB of storage? What’s the conversion? Could we imagine a 1B model to be implementable on high-end cell phones in a few years from now? Or would it be too slow to do a forward pass without fancy TPUs?
Great post.
I have a question. Suppose we want to create a decent language model which is a small as possible—small enough to run on a cell phone, say. We could try to compensate for this by scaling data to infinity. Now, we may run out of data, but if we do, we can generate more data artificially using a much larger LM. For example, consider training something BERT-sized using artificial data generated by PaLM (assume we have a very high compute budget in the training phase).
How well should we expect this to perform? If we plug into the above, it seems like 100M parameters (the size of BERT base, I think?) is hopelessly small and will never get anywhere, whereas at 1B we might approach “almost GPT3” given infinite data, and with 10B we have a realistic shot—did I do this right? What’s the right loss to put in from the limited data, given the data is not actually limited (it’s generated by PaLM) but it’s low quality (it’s generated by PaLM instead of being “real”)?
Also, is 1B parameters equal to around 4GB of storage? What’s the conversion? Could we imagine a 1B model to be implementable on high-end cell phones in a few years from now? Or would it be too slow to do a forward pass without fancy TPUs?
You’re describing a data augmentation variant of the teacher-student knowledge distillation. It can work well.
16 bits/parameter is most commonly supported but 8-bit quantization can also be used.
Performance does not depend only on the number of parameters but also on the architecture.
High-end smartphones commonly have special-purpose processors for neural networks, so their performance is not bad.