evhub comments on Claude 3.5 Sonnet

evhub 20 Jun 2024 20:03 UTC
3 points
−1
A thing I’d really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.
- William_S 20 Jun 2024 22:04 UTC
  8 points
  5
  Parent
  Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
  - evhub 21 Jun 2024 1:18 UTC
    6 points
    −7
    Parent
    Yeah, I agree it’s largely smartness, and I agree that it’d also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.
    - Joseph Bloom 21 Jun 2024 12:06 UTC
      4 points
      3
      Parent
      Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.
- Aaron_Scher 23 Jun 2024 18:08 UTC
  1 point
  0
  Parent
  Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.