graphpatch: a Python Library for Activation Patching

This post is an announcement for a software library. It is likely only relevant to those working, or looking to start working, in mechanistic interpretability.


What is graphpatch?

graphpatch is a Python library for activation patching on arbitrary PyTorch neural network models. It is designed to minimize the amount of boilerplate needed to run experiments making causal interventions on the intermediate activations in a model. It provides an intuitive API based on the structure of a torch.fx.Graph representation compiled automatically from the original model. For a somewhat silly example, I can make Llama play Taboo by zero-ablating its output for the token representing “Paris”:

with patchable_llama.patch(
  {"lm_head.output": ZeroPatch(slice=(slice(None), slice(None), 3681))}
):
  print(
    tokenizer.batch_decode(
      patchable_llama.generate(
        tokenizer(
          "The Eiffel Tower, located in",
          return_tensors="pt"
        ).input_ids,
        max_length=20,
        use_cache=False,
      )
    )
  )

["<s> The Eiffel Tower, located in the heart of the French capital, is the most visited"]

Why is graphpatch?

graphpatch is a tool I wished had existed when I started my descent into madness entry into mechanistic interpretability with an attempt to replicate ROME on Llama. I hope that by reducing inconveniences (trivial and otherwise) I can both ease entry into the field and lower cognitive overhead for existing researchers. In particular, I want to make it easier to start running experiments on “off-the-shelf” models without the need to handle annoying setup—such as rewriting the model’s Python code to expose intermediate activations—before even getting started. Thus, while graphpatch should work equally well on custom-built research models, I focused on integration with the Huggingface ecosystem with:

How do I graphpatch?

graphpatch is available on PyPI and can be installed with pip:

pip install graphpatch

You can read an overview on the GitHub page for the project. Full documentation is available on Read the Docs.

I have also provided a Docker image that might be useful for starting your own mechanistic interpretability experiments on cloud hardware. See this directory for some scripts and notes on my development process, which may be adaptable to your own use case.