That line was indeed quite poorly phrased. It now reads:
At the bottom of the box, blue or red token boxes show the tokens most promoted (blue) and most suppressed (red) by that dimension.
That is, you’re right. Interpretability data on an autoencoder dimension comes from seeing which token probabilities are most promoted and suppressed when that dimension is ablated, relative to leaving its activation value alone. That’s an ablation effect sign, so the implied, plotted promotion effect signs are flipped.
That line was indeed quite poorly phrased. It now reads:
That is, you’re right. Interpretability data on an autoencoder dimension comes from seeing which token probabilities are most promoted and suppressed when that dimension is ablated, relative to leaving its activation value alone. That’s an ablation effect sign, so the implied, plotted promotion effect signs are flipped.