This post outlines three AI alignment skeptic positions and corresponding responses from an advocate. Note that while the author tends to agree with the advocate’s view, they also believe that the skeptic makes good points.
1. The alignment problem gets easier as models get smarter, since they start to learn the difference between, say, human smiles and human well-being. So all we need to do is to prompt them appropriately, e.g. by setting up a conversation with “a wise and benevolent AI advisor”.
_Response:_ We can do a lot better than prompting: in fact, <@a recent paper@>(@True Few-Shot Learning with Language Models@) showed that prompting is effectively (poor) finetuning, so we might as well finetune. Separately from prompting itself, alignment does get easier in some ways as models get smarter, but it also gets harder: for example, smarter models will game their reward functions in more unexpected and clever ways.
2. What’s the difference between alignment and capabilities anyway? Something like <@RL from human feedback for summarization@>(@Learning to Summarize with Human Feedback@) could equally well have been motivated through a focus on AI products.
_Response:_ While there’s certainly overlap, alignment research is usually not the lowest-hanging fruit for building products. So it’s useful to have alignment-focused teams that can champion the work even when it doesn’t provide the best near-term ROI.
3. We can’t make useful progress on aligning superhuman models until we actually have superhuman models to study. Why not wait until those are available?
_Response:_ If we don’t start now, then in the short term, companies will deploy products that optimize simple objectives like revenue and engagement, which could be improved by alignment work. In the long term, it is plausible that alignment is very hard, such that we need many conceptual advances that we need to start on now to have them ready by the point that we feel obligated to use powerful AI systems. In addition, empirically there seem to be many alignment approaches that aren’t bottlenecked by the capabilities of models—see for example <@this post@>(@The case for aligning narrowly superhuman models@).
Planned opinion:
I generally agree with these positions and responses, and in particular I’m especially happy about the arguments being specific to the actual models we use today, which grounds out the discussion a lot more and makes it easier to make progress. On the second point in particular, I’d also [say](https://www.alignmentforum.org/posts/6ccG9i5cTncebmhsH/frequent-arguments-about-alignment?commentId=cwgpCBfwHaYravLpo) that empirically, product-focused people don’t do e.g. RL for human feedback, even if it could be motivated that way.
Planned summary for the Alignment Newsletter:
Planned opinion: