PaLM2 followed closely [to] Chinchilla optimal scaling.
No explicit mention of number of parameters, data withheld.
Claim performance is generally equivalent to GPT-4.
Chain-of-thought reasoning is called out explicitly quite a bit.
Claims of longer context length, but no specific size in the technical report. From the api page: “75+ tokens per second and a context window of 8,000 tokens,”
“The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute”
“The pre-training corpus is significantly larger than the corpus used to train PaLM [which was 780B tokens]”
Couple of more takeaways I jotted down:
PaLM2 followed closely [to] Chinchilla optimal scaling. No explicit mention of number of parameters, data withheld. Claim performance is generally equivalent to GPT-4. Chain-of-thought reasoning is called out explicitly quite a bit.
Claims of longer context length, but no specific size in the technical report. From the api page: “75+ tokens per second and a context window of 8,000 tokens,”
“The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute” “The pre-training corpus is significantly larger than the corpus used to train PaLM [which was 780B tokens]”