hwold comments on On the lethality of biased human reward ratings

hwold 18 Nov 2023 10:13 UTC
13 points
−1
A prime example of what (I believe) Yudkowsky is talking about in this bullet point is Social Desirability Bias.
“What is the highest cost we are willing to spend in order to save a single child dying from leukemia ?”. Obviously the correct answer is not infinite. Obviously teaching an AI that the answer to this class of questions is “infinite” is lethal. Also, incidentally, most humans will reply “infinite” to this question.