“Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
as we scale existing technology up or change details of NN architectures, gradient methods, etc
I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
“Weak methods” means confidence is achieved more empirically
I’d personally say “empirically promising methods” instead of “weak methods.”
“Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
I’d personally say “empirically promising methods” instead of “weak methods.”