Interesting, thanks! Like, this lets the model somewhat localise the scaling effect, so there’s not a ton of interference? This seems maybe linked to the results on Emergent Features in the residual stream
Interesting, thanks! Like, this lets the model somewhat localise the scaling effect, so there’s not a ton of interference? This seems maybe linked to the results on Emergent Features in the residual stream