“Trends are meaningful, individual data points are not”[1]
Claims like “This model gets x% on this benchmark” or “With this prompt this model does X with p probability” are often individually quite meaningless. Is 60% on a benchmark a lot or not? Hard to tell. Is doing X 20% of the time a lot or not? Go figure.
On the other hand, if you have results like “previous models got 10% and 20% on this benchmark, but our model gets 60%”, then that sure sounds like something. “With this prompt the model does X with 20% probability, but with this modification the probability drops to <1%” also sounds like something, as does “models will do more/less of Y with model size”.
There are some exceptions: maybe your results really are good enough to stand on their own (xkcd), maybe it’s interesting that something happens even some of the time (see also this). It’s still a good rule of thumb.
“Trends are meaningful, individual data points are not”[1]
Claims like “This model gets x% on this benchmark” or “With this prompt this model does X with p probability” are often individually quite meaningless. Is 60% on a benchmark a lot or not? Hard to tell. Is doing X 20% of the time a lot or not? Go figure.
On the other hand, if you have results like “previous models got 10% and 20% on this benchmark, but our model gets 60%”, then that sure sounds like something. “With this prompt the model does X with 20% probability, but with this modification the probability drops to <1%” also sounds like something, as does “models will do more/less of Y with model size”.
There are some exceptions: maybe your results really are good enough to stand on their own (xkcd), maybe it’s interesting that something happens even some of the time (see also this). It’s still a good rule of thumb.
Shoutout to Evan Hubinger for stressing this point to me