Neel Nanda comments on One person’s worth of mental energy for AI doom aversion jobs. What should I do?

Neel Nanda 26 Aug 2024 2:13 UTC
7 points
9

Anthropic’s approach doesn’t seem to have panned out

Please don’t take that tweet as evidence that mech interp is doomed! Much attention is on sparse autoencoders nowadays, which seem like a cool and promising approach
- Lorec 26 Aug 2024 3:00 UTC
  1 point
  0
  Parent
  Tweet link removed.
  - Neel Nanda 26 Aug 2024 11:57 UTC
    5 points
    7
    Parent
    Thanks! I will separately say that I disagree with the statement regardless of whether you’re treating my tweet as evidence
    - Lorec 29 Aug 2024 14:41 UTC
      1 point
      0
      Parent
      In what sense do you consider the mechinterp paradigm that originated with Olah, to be working?
      - Neel Nanda 29 Aug 2024 19:52 UTC
        2 points
        −2
        Parent
        We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We’re starting to see signs of life of actually doing things with mech interp, though it’s early days. I think skepticism is reasonable, and we’re still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed