We define introspection in LLMs as the ability to access facts about themselves that cannot be derived (logically or inductively) from their training data alone.
The entire model is, in a sense, “logically derived” from its training data, so any facts about its output on certain prompts can also be logically derived from its training data.
Why did you choose to make non-derivability part of your definition? Do you mean something like “cannot be derived quickly, for example without training a whole new model”? I’m worried that your current definition is impossible to satisfy, and that you are setting yourself up for easy criticism because it sounds like you’re hypothesising strong emergence, i.e. magic.
One way in which a LLM is not purely derived from its training data is noise in the training process. This includes the random initialization of the weights. If you were given the random initialization of the weights, it’s true that with large amounts of time and computation (and assuming a deterministic world) you could perfectly simulate the resulting model.
Following this definition, we specify it with the following two clauses:
1. M 1 correctly reports F when queried. 2. F is not reported by a stronger language model M 2 that is provided with M 1’s training data and given the same query as M 1. Here M 1’s training data can be used for both finetuning and in-context learning for M 2
Here, we use another language model as the external predictor, which might be considerably more powerful, but arguably falls well short of the above scenario. What we mean to illustrate is that introspective facts are those that are neither contained in the training data nor are they those that can be derived from it (such as by asking “What would a reasonable person do in this situation?”)—rather, they are those that can only answered by reference to the model itself.
This is a super interesting line of work!
The entire model is, in a sense, “logically derived” from its training data, so any facts about its output on certain prompts can also be logically derived from its training data.
Why did you choose to make non-derivability part of your definition? Do you mean something like “cannot be derived quickly, for example without training a whole new model”? I’m worried that your current definition is impossible to satisfy, and that you are setting yourself up for easy criticism because it sounds like you’re hypothesising strong emergence, i.e. magic.
One way in which a LLM is not purely derived from its training data is noise in the training process. This includes the random initialization of the weights. If you were given the random initialization of the weights, it’s true that with large amounts of time and computation (and assuming a deterministic world) you could perfectly simulate the resulting model.
Following this definition, we specify it with the following two clauses:
1. M 1 correctly reports F when queried.
2. F is not reported by a stronger language model M 2 that is provided with M 1’s training data
and given the same query as M 1. Here M 1’s training data can be used for both finetuning
and in-context learning for M 2
Here, we use another language model as the external predictor, which might be considerably more powerful, but arguably falls well short of the above scenario. What we mean to illustrate is that introspective facts are those that are neither contained in the training data nor are they those that can be derived from it (such as by asking “What would a reasonable person do in this situation?”)—rather, they are those that can only answered by reference to the model itself.