Anthropic is making a big deal of this and what it means for AI safety—it sort of reminds me of the excitement MIRI had when discovering logical inductors. I’ve read through the paper, and it does seem very exciting to be able to have this sort of “dial” that can find interpretable features at different levels of abstraction.
I’m curious about other people’s takes on this who work in alignment. It seems like if something fundamental is being touched on here, then it could provide large boons to research agendas such as Mechanistic Interpretability agenda, Natural Abstractions, Shard Theory, and Ambitious Value Learning.
But it’s also possible there are hidden gotchas I’m not seeing, or that this still doesn’t solve the hard problem that people see in going from “inscrutable matrices” to “aligned AI”.
What are people’s takes?
Especially interested in people who are full-time alignment researchers.
The basic idea is not new to me—I can’t recall where, but I think I’ve probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you’d expect to be meaningful (under some assumptions) because that’s how the next layer of neurons looks at a layer—since linear combinations are what’s important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn’t even surprising to me at the time I first learned about it.
But it’s great to see it illustrated so well!
In my view, this provides relatively little insights to the hard questions of what it even means to understand what is going on inside a network (so, for example, it doesn’t provide any obvious progress on the hard version of ELK). So how useful this ultimately turns out to be for aligning superintelligence depends on how useful “weak methods” in general are. (IE methods with empirical validation but which don’t come with strong theoretical arguments that they will work in general.)
That being said, I am quite glad that such good progress is being made, even if it’s what I would classify as “weak methods”.
“Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
as we scale existing technology up or change details of NN architectures, gradient methods, etc
I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
“Weak methods” means confidence is achieved more empirically
I’d personally say “empirically promising methods” instead of “weak methods.”
Anthropic is making a big deal of this and what it means for AI safety—it sort of reminds me of the excitement MIRI had when discovering logical inductors. I’ve read through the paper, and it does seem very exciting to be able to have this sort of “dial” that can find interpretable features at different levels of abstraction.
I’m curious about other people’s takes on this who work in alignment. It seems like if something fundamental is being touched on here, then it could provide large boons to research agendas such as Mechanistic Interpretability agenda, Natural Abstractions, Shard Theory, and Ambitious Value Learning.
But it’s also possible there are hidden gotchas I’m not seeing, or that this still doesn’t solve the hard problem that people see in going from “inscrutable matrices” to “aligned AI”.
What are people’s takes?
Especially interested in people who are full-time alignment researchers.
The basic idea is not new to me—I can’t recall where, but I think I’ve probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you’d expect to be meaningful (under some assumptions) because that’s how the next layer of neurons looks at a layer—since linear combinations are what’s important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn’t even surprising to me at the time I first learned about it.
But it’s great to see it illustrated so well!
In my view, this provides relatively little insights to the hard questions of what it even means to understand what is going on inside a network (so, for example, it doesn’t provide any obvious progress on the hard version of ELK). So how useful this ultimately turns out to be for aligning superintelligence depends on how useful “weak methods” in general are. (IE methods with empirical validation but which don’t come with strong theoretical arguments that they will work in general.)
That being said, I am quite glad that such good progress is being made, even if it’s what I would classify as “weak methods”.
How would you distinguish between weak and strong methods?
“Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
I’d personally say “empirically promising methods” instead of “weak methods.”
A useful advance, but it definitely needs to scale, and you could reasonably argue that it still needs a lot more work before it can be very useful.