Post summary (feel free to suggest edits!): The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case: For outer alignment, say researchers want a chatbot that gives helpful, honest answers—but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.
(This will appear in this week’s forum summary. If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
This summary covers the first two, but not the third or fourth—and the fourth one (“inner alignment for simulators”) is what I’m most concerned about in this post (because I think Scott ignores it, and because I think it’s hard to solve).
Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case:
For outer alignment, say researchers want a chatbot that gives helpful, honest answers—but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.
(This will appear in this week’s forum summary. If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
I think this is missing an important part of the post.
I have subsections on (what I claim are) four distinct alignment problems:
Outer alignment for characters
Inner alignment for characters
Outer alignment for simulators
Inner alignment for simulators
This summary covers the first two, but not the third or fourth—and the fourth one (“inner alignment for simulators”) is what I’m most concerned about in this post (because I think Scott ignores it, and because I think it’s hard to solve).