abramdemski comments on Seriously, what goes wrong with “reward the agent when it makes you smile”?

abramdemski 15 Aug 2022 16:01 UTC
LW: 2 AF: 2
0
AF
I’ve often repeated scenarios like this, or like the paperclip scenario.
My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen.
The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of our current ability to analyze the situation, not as part of a proto-model in which we are conjecturing that we will be able to predict “the AI will make paperclips” or “the AI will literally try to make you smile”.
What links here?
- Steven Byrnes's comment on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by Joe Carlsmith (15 Nov 2023 19:14 UTC; 5 points)