Should we expect these decompositions to be even more interpretable if the model was trained to output a prediction as soon as possible? (After any block, instead of outputting the prediction after the full network)
Potentially? My suspicion would be that in this case we would expect the basis in the residual stream to be extremely output-basis aligned while at the moment there is no real pressure for it to be (but it seems to be pretty output-aligned regardless, which is convenient for us). This might be a fun thing to fine-tune on.
Should we expect these decompositions to be even more interpretable if the model was trained to output a prediction as soon as possible? (After any block, instead of outputting the prediction after the full network)
Potentially? My suspicion would be that in this case we would expect the basis in the residual stream to be extremely output-basis aligned while at the moment there is no real pressure for it to be (but it seems to be pretty output-aligned regardless, which is convenient for us). This might be a fun thing to fine-tune on.