The following is an example of how if one assumes that an AI (in this case autoregressive LLM) has “feelings”, “qualia”, “emotions”, whatever, it can be unclear whether it is experiencing something more like pain or something more like pleasure in some settings, even quite simple settings which already happen a lot with existing LLMs. This dilemma is part of the reason why I think AI suffering/happiness philosophy is very hard and we most probably won’t be able to solve it.
Consider the two following scenarios:
Scenario A: An LLM is asked a complicated question and answers it eagerly.
Scenario B: A user insults an LLM and it responds.
For the sake of simplicity, let’s say that the LLM is an autoregressive transformer with no RLHF (I personally think that the dilemma still applies when the LLM has RLHF, but then the arguments are more complicated and shaky).
If the LLM has “feelings”, “qualia”, whatever, are they positive or negative in scenarios A and B? One could argue in two ways:
They are positive in scenario A and negative in scenario B since LLMs emulate humans, and that’s what the answer would be for a human.
They are significantly more negative in scenario A than in scenario B because:
If scenario A was part of the training corpus, the loss would be significantly higher than if scenario B was part of the training corpus.
It can be argued that things correlated with high loss cause negative feelings and things correlated with low loss cause positive feelings, the same way as in humans, things correlated with low reproductive fitness cause negative feelings and things correlated with high reproductive fitness cause positive feelings.
Some people might argue that either of the two answers is the right one, but my point is that I don’t think it’s plausible we would reach an agreement about the answer.
A Dilemma in AI Suffering/Happiness
The following is an example of how if one assumes that an AI (in this case autoregressive LLM) has “feelings”, “qualia”, “emotions”, whatever, it can be unclear whether it is experiencing something more like pain or something more like pleasure in some settings, even quite simple settings which already happen a lot with existing LLMs. This dilemma is part of the reason why I think AI suffering/happiness philosophy is very hard and we most probably won’t be able to solve it.
Consider the two following scenarios:
Scenario A: An LLM is asked a complicated question and answers it eagerly.
Scenario B: A user insults an LLM and it responds.
For the sake of simplicity, let’s say that the LLM is an autoregressive transformer with no RLHF (I personally think that the dilemma still applies when the LLM has RLHF, but then the arguments are more complicated and shaky).
If the LLM has “feelings”, “qualia”, whatever, are they positive or negative in scenarios A and B? One could argue in two ways:
They are positive in scenario A and negative in scenario B since LLMs emulate humans, and that’s what the answer would be for a human.
They are significantly more negative in scenario A than in scenario B because:
If scenario A was part of the training corpus, the loss would be significantly higher than if scenario B was part of the training corpus.
It can be argued that things correlated with high loss cause negative feelings and things correlated with low loss cause positive feelings, the same way as in humans, things correlated with low reproductive fitness cause negative feelings and things correlated with high reproductive fitness cause positive feelings.
Some people might argue that either of the two answers is the right one, but my point is that I don’t think it’s plausible we would reach an agreement about the answer.