Could it be that Chris’s diagram gets recovered if the vertical scale is “total interpretable capabilities”? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they’re doing, but they’re not doing much, so maybe it’s still the case that the amount of capability we can understand has a valley and then a peak at higher capability.
As in, the ratio between (interpretable capabilities / total capabilities) still asymptotes to zero, but the number of interpretable capabilities goes up (and then maybe back down) as the models gain more capabilities?
Could it be that Chris’s diagram gets recovered if the vertical scale is “total interpretable capabilities”? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they’re doing, but they’re not doing much, so maybe it’s still the case that the amount of capability we can understand has a valley and then a peak at higher capability.
As in, the ratio between (interpretable capabilities / total capabilities) still asymptotes to zero, but the number of interpretable capabilities goes up (and then maybe back down) as the models gain more capabilities?
Yeah. Or maybe not even to zero but it isn’t increasing.