The fact that is is all of the the previous probability of 42.2% is key here: I’d suggest normalizing this as −100% (of the previous value)
-80.7%
This is a good chunk, but not all of the previous 99.9%, so displaying it normalized as −80.6% would make this clearer.
However, the current format is probably better for the upweighted token increases.
You can always cross-reference more comprehensive interpretability data for any given dimension on Neuronpedia using those two indices.
Could you hotlink the boxes on the diagrams to that, or add the resulting content as a hover text to areas, in them or something? This might be hard to do on LW: I suspect some Javascript code might be required to do this sort of thing, but perhaps a library exists for this?
Could you hotlink the boxes on the diagrams to that, or add the resulting content as a hover text to areas, in them or something? This might be hard to do on LW: I suspect some Javascript code might be required to do this sort of thing, but perhaps a library exists for this?
My workaround was to have the dimension links laid out below each figure.
The fact that is is all of the the previous probability of 42.2% is key here: I’d suggest normalizing this as −100% (of the previous value)
This is a good chunk, but not all of the previous 99.9%, so displaying it normalized as −80.6% would make this clearer.
However, the current format is probably better for the upweighted token increases.
Could you hotlink the boxes on the diagrams to that, or add the resulting content as a hover text to areas, in them or something? This might be hard to do on LW: I suspect some Javascript code might be required to do this sort of thing, but perhaps a library exists for this?
My workaround was to have the dimension links laid out below each figure.
My current “print to flat .png” approach wouldn’t support hyperlinks, and I don’t think LW supports .svg images.