I would be interested in a quantitative experiment showing what % of the models’ performance is explained by this linear assumption. For example, identify all output weight directions that correspond to “fire”, project those out only for the direct path to the output (and not the path to later heads/MLPs), and see if it tanks accuracy on sentences where the next token is fire.
I’m confused how to interpret this alongside Conjecture’s polytope framing? That work suggested that magnitude as well as direction in activation space is important. I know this analysis is looking at the weights, but obviously the weights affect the activations, so it seems like the linearity assumption shouldn’t hold?
So the quantitative experiment you propose is a good idea—and we will be working along these lines, extending the very preliminary experiments in the post about how big of an effect edits like this will have.
In terms of the polytopes, you are right that this doesn’t really fit in with that framework but assumes a pure linear directions framework. We aren’t really wedded to any specific viewpoint and are trying a lot of different perspectives to try to figure out what the correct ontology to understand neural network internals is.
I enjoyed reading this a lot.
I would be interested in a quantitative experiment showing what % of the models’ performance is explained by this linear assumption. For example, identify all output weight directions that correspond to “fire”, project those out only for the direct path to the output (and not the path to later heads/MLPs), and see if it tanks accuracy on sentences where the next token is fire.
I’m confused how to interpret this alongside Conjecture’s polytope framing? That work suggested that magnitude as well as direction in activation space is important. I know this analysis is looking at the weights, but obviously the weights affect the activations, so it seems like the linearity assumption shouldn’t hold?
So the quantitative experiment you propose is a good idea—and we will be working along these lines, extending the very preliminary experiments in the post about how big of an effect edits like this will have.
In terms of the polytopes, you are right that this doesn’t really fit in with that framework but assumes a pure linear directions framework. We aren’t really wedded to any specific viewpoint and are trying a lot of different perspectives to try to figure out what the correct ontology to understand neural network internals is.