Jett Janiak
Jett Janiak(Jett Mayzner)
Karma: 135
Characterizing stable regions in the residual stream of LLMs
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
This is such a cool result! I tried to reproduce it in this notebook
For the two sets of mess3 parameters I checked the stationary distribution was uniform.
AISC project: TinyEvals
Polysemantic Attention Head in a 4-Layer Transformer
The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.
I believe there are two phenomena happening during training
Predictions corresponding to the same stable region become more similar, i.e. stable regions become more stable. We can observe this in the animations.
Existing regions split, resulting in more regions.
I hypothesize that
could be some kind of error correction. Models learn to rectify errors coming from superposition interference or another kind of noise.
could be interpreted as more capable models picking up on subtler differences between the prompts and adjusting their predictions.