When we compare results from PaLM 540B to our own identically trained 62B and 8B model variants, improvements are typically log-linear. This alone suggests that we have not yet reached the apex point of the scaling curve. However, on a number of benchmarks, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then jump immensely when scaling to 540B. This suggests that certain capabilities of language models only emerge when trained at sufficient scale, and there are additional capabilities that could emerge from future generations of models.
Examples of the tasks for which there was discontinuous improvement were: english_proverbs (guess which proverb best describes a text passage from a list—requires a very high level of abstract thinking) and logical_sequence (order a set of “things” (months, actions, numbers, letters, etc.) into their logical ordering.).
Eg of a logical_sequence task:
Input: Which of the following lists is correctly ordered chronologically? (a) drink water, feel thirsty, seal water bottle, open water bottle (b) feel thirsty, open water bottle, drink water, seal water bottle (c) seal water bottle, open water bottle, drink water, feel thirsty
Examples of the tasks for which there was discontinuous improvement were: english_proverbs (guess which proverb best describes a text passage from a list—requires a very high level of abstract thinking) and logical_sequence (order a set of “things” (months, actions, numbers, letters, etc.) into their logical ordering.).
Eg of a logical_sequence task:
Over all 150 tasks [in BIG-bench], 25% of tasks had discontinuity greater than +10%, and 15% of tasks had a discontinuity greater than +20%.
Discontinuity = (actual accuracy for 540B model) - (log-linear projection using 8b → 62b)