William_S comments on Claude 3.5 Sonnet

William_S 20 Jun 2024 22:04 UTC
8 points
5
Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
- evhub 21 Jun 2024 1:18 UTC
  6 points
  −7
  Parent
  Yeah, I agree it’s largely smartness, and I agree that it’d also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.
  - Joseph Bloom 21 Jun 2024 12:06 UTC
    4 points
    3
    Parent
    Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.