Post summary (feel free to suggest edits!):
‘Setting the Zero Point’ is a “Dark Art” ie. something which causes someone else’s map to unmatch the territory in a way that’s advantageous to you. It involves speaking in a way that takes for granted that the line between ‘good’ and ’bad is at a particular point, without explicitly arguing for that. This makes changes between points below and above that line feel more significant.
As an example, many people draw a zero point between helping and not helping a child drowning in front of them. One is good, one is bad. The Drowning Child argument argues this point is wrongly set, and should be between helping and not helping any dying child.
The author describes 14 examples, and suggests that it’s useful to be aware of this dynamic and explicitly name zero points when you notice them.
(If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case:
For outer alignment, say researchers want a chatbot that gives helpful, honest answers—but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.
(This will appear in this week’s forum summary. If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)