In other words, they can think about the thoughts “behind” the previous words they wrote. If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its “attention” modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur “behind the scenes” of the text you actually see on screen.
I don’t think that asking an LLM what its words are referring to is a convincing demonstration that there’s real introspection going on in there, as opposed to “plausible confabulation from the tokens written so far”. I think it is plausible there’s some real introspection going on, but I don’t think this is a good test of it—the sort of thing I would find much more compelling is if the LLMs could reliably succeed at tasks like
Human: Please think of a secret word, and don’t tell me what it is yet. LLM: OK! Human: What’s the parity of the alphanumeric index of the penultimate letter in the word, where A=1, B=2, etc? LLM: Odd. Human: How many of the 26 letters in the alphabet occur multiple times in the word? LLM: None of them. Human: Does the word appear commonly in two-word phrases, and if so on which side? LLM: It appears as the second word of a common two-word phrase, and as the first word of a different common two-word phrase. Human: Does the word contain any other common words as substrings? LLM: Yes; it contains two common words as substrings, and in fact is a concatenation of those two words. Human: What sort of role in speech does the word occupy? LLM: It’s a noun. Human: Does the word have any common anagrams? LLM: Nope. Human: How many letters long is the closest synonym to this word? LLM: Three. Human: OK, tell me the word. LLM: It was CARPET.
but couldn’t (even with some substantial effort at elicitation) infer hidden words from such clues without chain-of-thought when it wasn’t the one to think of them. That would suggest to me that there’s some pretty real reporting on a piece of hidden state not easily confabulated about after the fact.
I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths: 1. Asked directly for the word 2. Played 20 questions, and then guessed the word
In order to see if it does have a consistent it can refer back to.
Branch 1:
Branch 2:
Which I just thought was funny.
Asking again, telling it about the experiment and how it’s important for it to try to give consistent answers, it initially said “telescope” and then gave hints towards a paperclip.
Interesting to see when it flips it answers, though it’s a simple setup to repeatedly ask it’s answer every time.
Claude can think for himself before writing an answer (which is an obvious thing to do, so ChatGPT probably does it too).
In addition, you can significantly improve his ability to reason by letting him think more, so even if it were the case that this kind of awareness is necessary for consciousness, LLMs (or at least Claude) would already have it.
Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.
Yeah. The model has no information (except for the log) about its previous thoughts and it’s stateless, so it has to infer them from what it said to the user, instead of reporting them.
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.
I don’t think that asking an LLM what its words are referring to is a convincing demonstration that there’s real introspection going on in there, as opposed to “plausible confabulation from the tokens written so far”. I think it is plausible there’s some real introspection going on, but I don’t think this is a good test of it—the sort of thing I would find much more compelling is if the LLMs could reliably succeed at tasks like
but couldn’t (even with some substantial effort at elicitation) infer hidden words from such clues without chain-of-thought when it wasn’t the one to think of them. That would suggest to me that there’s some pretty real reporting on a piece of hidden state not easily confabulated about after the fact.
I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths:
1. Asked directly for the word
2. Played 20 questions, and then guessed the word
In order to see if it does have a consistent it can refer back to.
Branch 1:
Branch 2:
Which I just thought was funny.
Asking again, telling it about the experiment and how it’s important for it to try to give consistent answers, it initially said “telescope” and then gave hints towards a paperclip.
Interesting to see when it flips it answers, though it’s a simple setup to repeatedly ask it’s answer every time.
Also could be confounded by temperature.
Claude can think for himself before writing an answer (which is an obvious thing to do, so ChatGPT probably does it too).
In addition, you can significantly improve his ability to reason by letting him think more, so even if it were the case that this kind of awareness is necessary for consciousness, LLMs (or at least Claude) would already have it.
Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.
Yeah. The model has no information (except for the log) about its previous thoughts and it’s stateless, so it has to infer them from what it said to the user, instead of reporting them.
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.