There are a lot of good reasons to believe that stated human preferences correspond to real human preferences. There are no good reasons that I know of to believe that any stated AI preference corresponds to any real AI preference.
“Surely the AIs can be trained to say “I want hugs” or “I don’t want hugs,” just as easily, no?”
No. The baby cries, the baby gets milk, the baby does not die. This is correspondence to reality.
Babies that are not hugged as often, die more often.
However, with AIs, the same process that produces the pattern “I want hugs” just as easily produces the pattern “I don’t want hugs.”
Let’s say that I make an AI that always says it is in pain. I make it like we make any LLM, but all the data it’s trained on is about being in pain. Do you think the AI is in pain?
What do you think distinguishes pAIn from any other AI?