Interesting stuff. The nonlinearity of requiring long sequences of tokens doesn’t seem to be a fatal objection to measuring long sequences, because often we’re interested in capabilities that really do require getting long sequences all correct. But from the perspective of predicting capabilities, this is definitely a point for team straight lines on graphs.
Interesting stuff. The nonlinearity of requiring long sequences of tokens doesn’t seem to be a fatal objection to measuring long sequences, because often we’re interested in capabilities that really do require getting long sequences all correct. But from the perspective of predicting capabilities, this is definitely a point for team straight lines on graphs.