Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.
Yeah. The model has no information (except for the log) about its previous thoughts and it’s stateless, so it has to infer them from what it said to the user, instead of reporting them.
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.
Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.
Yeah. The model has no information (except for the log) about its previous thoughts and it’s stateless, so it has to infer them from what it said to the user, instead of reporting them.
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.