Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it’s too late to make any difference.
To put a finer point on my view on theory vs empirics in alignment:
Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of “total amount of person-hours needed in each category” and in terms of “total ‘credit’ each category should receive for reducing doom in some sense”).
Conditional on an alignment researcher agreeing with my view of the high-level problem, I tend to be more excited about them if they’re working on ML experiments than if they’re working on theory.
I’m quite skeptical of most theoretical alignment research I’ve seen. The main theoretical research I’m excited about is ARC’s, and I have a massive conflict of interest since the founder is my husband—I would feel fairly sympathetic to people who viewed ARC’s work more like how I view other theory work.
With that said, I think unfortunately there is a lot less good empirical work than in some sense there “could be.” One significant reason why a lot of empirical AI safety work feels less exciting than it could be is that the people doing that work don’t always share my perspective on the problem, so they focus on difficulties I expect to be less core. (Though another big reason is just that everything is hard, especially when we’re working with systems a lot less capable than future systems.)
No particular reason—I can’t figure out how to cross post now so I sent a request.