Something worth reemphasizing for folks not in the field is that these benchmarks are not like usual benchmarks where you train the model on the task, and then see how good it does on a held-out set. Chinchilla was not explicitly trained on any of these problems. It’s typically given some context like:
“Q: What is the southernmost continent?
A: Antarctica
Q: What is the continent north of Africa?
A:”
and then simply completes the prompt until a stop token is emitted, like a newline character.
And it’s performing above-average-human on these benchmarks.
Something worth reemphasizing for folks not in the field is that these benchmarks are not like usual benchmarks where you train the model on the task, and then see how good it does on a held-out set. Chinchilla was not explicitly trained on any of these problems. It’s typically given some context like: “Q: What is the southernmost continent? A: Antarctica Q: What is the continent north of Africa? A:” and then simply completes the prompt until a stop token is emitted, like a newline character.
And it’s performing above-average-human on these benchmarks.