Jett Janiak

Karma: 164

Jett Janiak Sep 27, 2024, 12:26 PM
4 points
0
in reply to: gwern’s comment on: Characterizing stable regions in the residual stream of LLMs
I’m not familiar with this interpretation. Here’s what Claude has to say (correct about stable regions, maybe hallucinating about Hopfield networks)
This is an interesting question that connects the findings in the paper to broader theories about how transformer models operate. Let me break down my thoughts:
The paper’s findings and the Hopfield network interpretation of self-attention are not directly contradictory, but they’re not perfectly aligned either. Let’s examine this in more detail:
1. The paper’s key findings:
  The residual stream of trained transformers contains stable regions where small activation changes lead to minimal output changes.
  These regions emerge during training and become more defined as training progresses or model size increases.
  The regions appear to correspond to semantic distinctions.
2. The Hopfield network interpretation of self-attention:
  Self-attention layers in transformers can be seen as performing energy-based updates similar to those in continuous Hopfield networks.
  This view suggests that self-attention is performing pattern completion or error correction, moving activations towards learned attractor states.
Now, let’s consider how these might relate:
1. Consistency with stable regions: The existence of stable regions in the residual stream could be consistent with the idea of attractor states in a Hopfield-like network. The stable regions might correspond to basins of attraction around these states.
2. Emergence during training: The paper observes that stable regions emerge and become more defined during training. This aligns well with the Hopfield network interpretation, as training would refine the attractor landscapes.
3. Semantic correspondence: The paper suggests that stable regions correspond to semantic distinctions. This is compatible with the Hopfield network view, where different attractor states could represent different semantic concepts or categories.
4. Sharp transitions: The paper observes sharp transitions between stable regions. This is somewhat less aligned with the typical continuous Hopfield network dynamics, which often show smoother transitions. However, it’s not necessarily inconsistent, as the observed behavior could result from complex interactions across multiple layers.
5. Scale of regions: The paper suggests that these stable regions are much larger than previously studied polytopes. This might pose some challenges for the Hopfield network interpretation, as it implies a different granularity of representation than might be expected.

Jett Janiak Sep 26, 2024, 1:49 PM
4 points
0
on: Characterizing stable regions in the residual stream of LLMs
I believe there are two phenomena happening during training
1. Predictions corresponding to the same stable region become more similar, i.e. stable regions become more stable. We can observe this in the animations.
2. Existing regions split, resulting in more regions.
I hypothesize that
1. could be some kind of error correction. Models learn to rectify errors coming from superposition interference or another kind of noise.
2. could be interpreted as more capable models picking up on subtler differences between the prompts and adjusting their predictions.

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

Sep 26, 2024, 1:44 PM

42 points

4 comments1 min readLW link

(arxiv.org)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

Sep 25, 2024, 8:37 PM

29 points

0 comments3 min readLW link

(arxiv.org)

Jett Janiak Aug 30, 2024, 5:55 PM
2 points
0
on: AIS terminology proposal: standardize terms for probability ranges
Scott In Continued Defense Of Non-Frequentist Probabilities

Jett Janiak May 17, 2024, 10:08 AM
17 points
0
on: Transformers Represent Belief State Geometry in their Residual Stream
This is such a cool result! I tried to reproduce it in this notebook

Jett Janiak May 17, 2024, 9:47 AM
1 point
0
in reply to: Adam Shai’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
For the two sets of mess3 parameters I checked the stationary distribution was uniform.

AISC project: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM

22 points

0 comments4 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

Nov 9, 2023, 4:16 PM

51 points

0 comments6 min readLW link

Jett Janiak Oct 9, 2023, 8:41 AM
1 point
0
on: A Comprehensive Mechanistic Interpretability Explainer & Glossary
The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

Aug 30, 2023, 5:36 PM

17 points

0 comments8 min readLW link

(arxiv.org)

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

Feb 20, 2023, 7:35 PM

96 points

8 comments21 min readLW link

Jett Janiak

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

AISC pro­ject: TinyEvals

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

Characterizing stable regions in the residual stream of LLMs

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

AISC project: TinyEvals

Polysemantic Attention Head in a 4-Layer Transformer

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

A circuit for Python docstrings in a 4-layer attention-only transformer