What is a circuit? [in interpretability]

I’m aware of the understanding that “a circuit is a subgraph of a neural network that implements a specific computation.”

In practice (to my understanding) the way you identify “circuits” is by identifying components of the neural network that have high correlation with certain tasks, and doing some ablations to see if it’s “causally responsible” for performance on that task.

It feels like there’s also a different way of understanding circuits where morally, circuits themselves are sequences of operations done to features, where the features are the primitives and the operations are mostly considered to be the linearities/​nonlinearities represented in the model architecture (although I can understand different perspectives).

A few questions (forgive my ignorance):

  • If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other? Does this depend on training? (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)

  • What is a feature? Are there different circuits that appear in a network based on your definition of what a relevant feature is? How crisp are these circuits that appear, both in toy examples and in the wild?

  • What are the best examples of “circuits in the wild” that are actually robust?