A thing I’d really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.
Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
Yeah, I agree it’s largely smartness, and I agree that it’d also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.
Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.
A thing I’d really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.
Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
Yeah, I agree it’s largely smartness, and I agree that it’d also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.
Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.