paulfchristiano comments on AMA: Paul Christiano, alignment researcher

paulfchristiano Apr 30, 2021, 8:51 PM
9 points
It’s a bit hard to distinguish “direct” and “second-order” effects—e.g. any algorithm we develop will not be deployed directly (and likely would have been developed later if effective) but will be useful primarily for accelerating the development of later algorithms and getting practice making relevant kinds of progress etc.
One way to operationalize this is to ask something like “Which techniques would be used today if there was rapid unexpected progress in ML (e.g. a compute helicopter drop) that pushed it to risky levels?” Of course that will depend a bit on where the drop occurs or who uses it, but I’ll imagine that it’s the labs that currently train the largest ML models (which will of course bias the answer towards work that changes practices in those labs).
(I’m not sure if this operationalization is super helpful for prioritization, given that in fact I think most work mostly has “indirect” effects broadly construed. It still seems illuminating.)
I think that in this scenario we’d want to do something like RL from human feedback using the best evaluations we can find, and the fact that these methods are kind of familiar internally will probably be directly helpful. I think that the state of discourse about debate and iterated amplification would likely have a moderate positive impact on our ability to use ML systems productively as part of that evaluation process. I think that the practice of adversarial training and a broad understanding of the problems of ML robustness and the extent to which “more diverse data” is a good solution will have an impact on how carefully people do scaled-up adversarial training. I think that the broad arguments about risk, deceptive alignment, convergence etc. coming out of MIRI/FHI the broader rationalist community would likely improve people’s ability to notice weird stuff (e.g. signatures of deceptive alignment) and pause appropriately or prioritize work on solutions if those actually happen. I think there’s a modest chance that some kind of interpretability work directly inspired by modern work (like the Clarity team / Distill community) would detect a serious problem in a trained model and that this could cause us to change course and survive. I think probably “composition of teams that will work on alignment” and “governance of labs that will deploy AI” will both be quite important but less directly traceable to AI safety work.