Characterizing stable regions in the residual stream of LLMs

Link post

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by @Stefan Heimershiem (Apollo Research). Find out more about the program and express interest in upcoming iterations here.

This video is a short overview of the project presented on the final day of the LASR Labs. Note that the paper was updated since then.

Visualization of stable regions in OLMo-7B during training. Colors represent the similarity of model outputs to those produced by three model-generated activations (red, green, blue circles). Each frame shows a 2D slice of the residual stream after the first layer at different stages of training. As training progresses, distinct regions of solid color emerge and the boundaries between them sharpen. See here for more animations in better quality.

We study the effects of perturbing Transformer activations, building upon recent work by Gurnee, Lindsey, and Heimersheim & Mendel. Specifically, we interpolate between model-generated residual stream activations, and measure the change in the model output. Our initial results suggest that:

  1. The residual stream of a trained Transformer can be divided into stable regions. Within these regions, small changes in model activations lead to minimal changes in output. However, at region boundaries, small changes can lead to significant output differences.

  2. These regions emerge during training and evolves with model scale. Randomly initialized models do not exhibit these stable regions, but as training progresses or model size increases, the boundaries between regions become sharper.

  3. These stable regions appear to correspond to semantic distinctions. Dissimilar prompts occupy different regions, and activations from different regions produce different next token predictions.

  4. While further investigation is needed, these regions appear to be much larger than polytopes studied by Hanin & Rolnick and Black et al.

Sharpness of boundaries (y-axis) between stable regions seems to increase with the number of parameters (colors) and number of training tokens (x-axis). On y-axis we plot the maximum slope of the relative change in the output as we interpolate between two prompts. Data is aggregated across 1,000 randomly sampled pairs of prompts, dots represent the median, and error bars represent 25th and 75th percentiles.

We believe that studying stable regions can improve our understanding of how neural networks work. The extent to which this understanding is useful for safety is an active topic of discussion 123.

The updated paper with additional plots in the appendix is not yet visible or arxiv, but you can read it here.