At the bottom of the box, blue or red token boxes show the tokens most promoted (blue) and most suppressed (red) by ablating that dimension.
Is this inverted? Based on the names you gave each graph, it looks like you think the blue ones are the ones ablation demonstrated the feature to be an influential vote for. Which would mean you meant “promoted/suppressed by the dimension, as demonstrated by ablation causing the promoted tokens to be suppressed, and vice versa”. Unless I misread the graphs?
That line was indeed quite poorly phrased. It now reads:
At the bottom of the box, blue or red token boxes show the tokens most promoted (blue) and most suppressed (red) by that dimension.
That is, you’re right. Interpretability data on an autoencoder dimension comes from seeing which token probabilities are most promoted and suppressed when that dimension is ablated, relative to leaving its activation value alone. That’s an ablation effect sign, so the implied, plotted promotion effect signs are flipped.
Is this inverted? Based on the names you gave each graph, it looks like you think the blue ones are the ones ablation demonstrated the feature to be an influential vote for. Which would mean you meant “promoted/suppressed by the dimension, as demonstrated by ablation causing the promoted tokens to be suppressed, and vice versa”. Unless I misread the graphs?
That line was indeed quite poorly phrased. It now reads:
That is, you’re right. Interpretability data on an autoencoder dimension comes from seeing which token probabilities are most promoted and suppressed when that dimension is ablated, relative to leaving its activation value alone. That’s an ablation effect sign, so the implied, plotted promotion effect signs are flipped.