Neel Nanda comments on One person’s worth of mental energy for AI doom aversion jobs. What should I do?

Neel Nanda 29 Aug 2024 19:52 UTC
2 points
−2
We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We’re starting to see signs of life of actually doing things with mech interp, though it’s early days. I think skepticism is reasonable, and we’re still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed