I’m pretty confused as to why it’s become much more common to anthropomorphise LLMs.
At some point in the past the prevailing view was “a neural net is a mathematical construct and should be understood as such”. Assigning fundamentally human qualities like honesty or self-awareness was considered an epistemological faux pas.
Recently it seems like this trend has started to reverse. In particular, prosaic alignment work seems to be a major driver in the vocabulary shift. Nowadays we speak of LLMs that have internal goals, agency, self-identity, and even discuss their welfare.
I know it’s been a somewhat gradual shift, and that’s why I haven’t caught it until now, but I’m still really confused. Is the change in language driven by the qualitative shift in capabilities? do the old arguments no longer apply?
A mathematical construct that models human natural language could be said to express “agency” in a functional sense insofar as it can perform reasoning about goals, and “honesty” insofar as the language it emits accurately reflects the information encoded in its weights?
I agree that from a functional perspective, we can interact with an LLM in the same way as we would another human. At the same time I’m pretty sure we used to have good reasons for maintaining a conceptual distinction.
One potential issue is that when the language shifts to implicitly frame the LLM as a person, that subtly shifts the default perception on a ton of other linked issues. Eg the “LLM is a human” frame raises the questions of “do models deserve rights”.
But idunno, it’s possible that there’s some philosophical argument by which it makes sense to think of LLMs as human once they pass the turing test.
Also, there’s undoubtedly something lost when we try to be very precise. Having to dress discourse in qualifications makes the point more obscure, which doesn’t help when you want to leave a clear take home message. Framing the LLM as a human is a neat shorthand that preserves most of the xrisk-relevant meaning.
I guess I’m just wondering if alignment research has resorted to anthropomorphization because of some well considered reason I was unaware of, or simply because it’s more direct and therefore makes points more bluntly (“this LLM could kill you” vs “this LLM could simulate a very evil person who would kill you”).
I agree that from a functional perspective, we can interact with an LLM in the same way as we would another human. At the same time I’m pretty sure we used to have good reasons for maintaining a conceptual distinction.
I think of this through the lens of Daniel Dennett’s intentional stance; it’s a frame that we can adopt without making any claims about the fundamental nature of the LLM, one which has both upsides and downsides. I do think it’s important to be careful to stay aware that that’s what we’re doing in order to avoid sloppy thinking.
Nate Soares’ related framing as a ‘behaviorist sense’ is also useful to me:
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
“this LLM could kill you” vs “this LLM could simulate a very evil person who would kill you”
If the LLM simulates a very evil person who would kill you, and the LLM is connected to a robot, and the simulated person uses the robot to kill you… then I’d say that yes, the LLM killed you.
So far the reason why LLM cannot kill you is that it doesn’t have hands, and that it (the simulated person) is not smart enough to use e.g. their internet connection (that some LLMs have) to obtain such hands. It also doesn’t have (and maybe will never have) the capacity to drive you to suicide by a properly written output text, which would also be a form of killing.
I’m pretty confused as to why it’s become much more common to anthropomorphise LLMs.
At some point in the past the prevailing view was “a neural net is a mathematical construct and should be understood as such”. Assigning fundamentally human qualities like honesty or self-awareness was considered an epistemological faux pas.
Recently it seems like this trend has started to reverse. In particular, prosaic alignment work seems to be a major driver in the vocabulary shift. Nowadays we speak of LLMs that have internal goals, agency, self-identity, and even discuss their welfare.
I know it’s been a somewhat gradual shift, and that’s why I haven’t caught it until now, but I’m still really confused. Is the change in language driven by the qualitative shift in capabilities? do the old arguments no longer apply?
A mathematical construct that models human natural language could be said to express “agency” in a functional sense insofar as it can perform reasoning about goals, and “honesty” insofar as the language it emits accurately reflects the information encoded in its weights?
I agree that from a functional perspective, we can interact with an LLM in the same way as we would another human. At the same time I’m pretty sure we used to have good reasons for maintaining a conceptual distinction.
One potential issue is that when the language shifts to implicitly frame the LLM as a person, that subtly shifts the default perception on a ton of other linked issues. Eg the “LLM is a human” frame raises the questions of “do models deserve rights”.
But idunno, it’s possible that there’s some philosophical argument by which it makes sense to think of LLMs as human once they pass the turing test.
Also, there’s undoubtedly something lost when we try to be very precise. Having to dress discourse in qualifications makes the point more obscure, which doesn’t help when you want to leave a clear take home message. Framing the LLM as a human is a neat shorthand that preserves most of the xrisk-relevant meaning.
I guess I’m just wondering if alignment research has resorted to anthropomorphization because of some well considered reason I was unaware of, or simply because it’s more direct and therefore makes points more bluntly (“this LLM could kill you” vs “this LLM could simulate a very evil person who would kill you”).
I think of this through the lens of Daniel Dennett’s intentional stance; it’s a frame that we can adopt without making any claims about the fundamental nature of the LLM, one which has both upsides and downsides. I do think it’s important to be careful to stay aware that that’s what we’re doing in order to avoid sloppy thinking.
Nate Soares’ related framing as a ‘behaviorist sense’ is also useful to me:
If the LLM simulates a very evil person who would kill you, and the LLM is connected to a robot, and the simulated person uses the robot to kill you… then I’d say that yes, the LLM killed you.
So far the reason why LLM cannot kill you is that it doesn’t have hands, and that it (the simulated person) is not smart enough to use e.g. their internet connection (that some LLMs have) to obtain such hands. It also doesn’t have (and maybe will never have) the capacity to drive you to suicide by a properly written output text, which would also be a form of killing.