If you look at the logit given a range that is not [0.0, 1.0] but [low perf, high perf], then you get a bit more predictive power, but it is still confusingly low.
A possible intuition here is that the scaling is producing a transition from non-zero performance to non-perfect performance. This seems right since the random baseline is not 0.0 and reaching perfect accuracy is impossible.
I tried this only with PaLM on NLU and I used the same adjusted range for all tasks:
[0.9 * overall min. acc., 1.0 − 0.9 * (1.0 - overall max acc.)] ~ [0.13, 0.95]
Even if this model was true, they are maybe other additional explanations like the improvement on one task are not modeled by one logit function but by several of them. A task would be composed of sub-tasks each modelizable by one logit function. And if this make sense, one could try to model the improvements in all of the tasks using only a small number of logit curves associated to each sub-tasks (decomposing each tasks into a set of sub-tasks with a simple trend).
(Also Gopher looks like less predictable and the data more sparse (no data point in the X0 B parameters))
If you look at the logit given a range that is not [0.0, 1.0] but [low perf, high perf], then you get a bit more predictive power, but it is still confusingly low.
A possible intuition here is that the scaling is producing a transition from non-zero performance to non-perfect performance. This seems right since the random baseline is not 0.0 and reaching perfect accuracy is impossible.
I tried this only with PaLM on NLU and I used the same adjusted range for all tasks:
[0.9 * overall min. acc., 1.0 − 0.9 * (1.0 - overall max acc.)] ~ [0.13, 0.95]
Even if this model was true, they are maybe other additional explanations like the improvement on one task are not modeled by one logit function but by several of them. A task would be composed of sub-tasks each modelizable by one logit function. And if this make sense, one could try to model the improvements in all of the tasks using only a small number of logit curves associated to each sub-tasks (decomposing each tasks into a set of sub-tasks with a simple trend).
(Also Gopher looks like less predictable and the data more sparse (no data point in the X0 B parameters))