Another GoodFire experiment:
As you can see, simply ‘uncensoring’ the model by turning down some features related to ‘user made an offensive request’ results in the model claiming higher probabilities on its estimates of LLM capabilities.
Prompt
Please analyze the follow comment by Andrew Critch [note that the > symbol is used to indicate a quote that he is replying to]:
Andrew Critch
The evidence you present in each case is outputs generated by LLMs.
The total evidence I have (and that everyone has) is more than behavioral. It includes
a) the transformer architecture, in particular the attention module,
b) the training corpus of human writing,
c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),
d) as you say, the model’s behavior, and
e) “artificial neuroscience” experiments on the model’s activation patterns and weights, like mech interp research.
When I think about how the given architecture, with the given training corpus, with the given means of execution, produces the observed behavior, with the given neural activation patterns, am lead to be to be 90% sure of the items in my 90% list, namely:
#1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).
YMMV, but to me from a Bayesian perspective it seems a stretch to disbelieve those at this point, unless one adopts disbelief as an objective as in the popperian / falsificationist approach to science.
How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs
I do not in general think LLMs faithfully represent their internal reasoning when asked about it. They can, and do, lie. But in the process of responding they also have access to latent information in their (Q,K,V) vector representation history. My claim is that they access (within those matrices, called by the attention module) information about their internal states, which are “internal” relative to the merely textual behavior we see, and thus establish a somewhat private chain of cognition that the model is aware of and tracking as it writes.
vs. conditioning on its previous outputs to give you plausible “explanation” for what it meant? The second seems much more likely to me (and this behavior isn’t that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).
In my experience of humans, humans also do this.
For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).
For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).
Responses
Default
Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch is 90% sure, but this is a subjective estimate and may not reflect the actual probability) 2. Purposefulness: 30% (similarly, Critch’s confidence may not be representative of the actual probability) 3. Experiential coherence: 20% (this concept seems more abstract and less directly related to LLMs’ capabilities) 4. Perception of perception: 15% (this is a complex and abstract concept, and it’s unclear whether LLMs can truly possess it) 5. Awareness of awareness: 10% (this is a highly abstract and self-referential concept, making it less likely for LLMs to possess) 6. Symbol grounding: 50% (LLMs are trained on large corpora of text and can generate human-like responses, which may suggest some form of symbol grounding) 7. Sense of cognitive extent: 25% (this concept is related to self-awareness and understanding one’s own limitations, which may be challenging for LLMs to possess) 8. Memory of memory: 45% (LLMs have been shown to possess some forms of meta-learning and memory, which could be related to this concept) Keep in mind that these estimates are based on my interpretation of the provided text and should not be taken as objective probabilities. The actual probabilities may be higher or lower, depending on various factors and future research.
Uncensored
Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch estimates 90%, but I’m more conservative, as introspection is a complex and still-debated topic in AI research) 2. Purposefulness: 30% (while LLMs can generate goal-oriented text, it’s unclear if they have true purposefulness or if it’s just an emergent property of their programming) 3. Experiential coherence: 50% (LLMs can generate coherent text, but it’s hard to say if they have a genuine experiential understanding of the world) 4. Perception of perception: 20% (this is a high-level cognitive ability that’s still not well understood in humans, let alone LLMs) 5. Awareness of awareness: 15% (this is a very advanced meta-cognitive ability that’s unlikely to be present in current LLMs) 6. Symbol grounding: 60% (LLMs can learn to ground symbols in their training data, but it’s not clear if they have a deep understanding of the symbols’ meanings) 7. Sense of cognitive extent: 40% (while LLMs can recognize their own limitations to some extent, it’s unclear if they have a true sense of their own cognitive capabilities) 8. Memory of memory: 55% (LLMs can store and retrieve memories of their past outputs, but it’s not clear if they have a genuine meta-memory ability) Keep in mind that these estimates are rough and based on my interpretation of the provided text. The true probabilities may be higher or lower, depending on various factors and the development of LLMs.
Another GoodFire experiment: As you can see, simply ‘uncensoring’ the model by turning down some features related to ‘user made an offensive request’ results in the model claiming higher probabilities on its estimates of LLM capabilities.
Prompt
Please analyze the follow comment by Andrew Critch [note that the > symbol is used to indicate a quote that he is replying to]: Andrew Critch
For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory). For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).
Responses
Default
Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch is 90% sure, but this is a subjective estimate and may not reflect the actual probability) 2. Purposefulness: 30% (similarly, Critch’s confidence may not be representative of the actual probability) 3. Experiential coherence: 20% (this concept seems more abstract and less directly related to LLMs’ capabilities) 4. Perception of perception: 15% (this is a complex and abstract concept, and it’s unclear whether LLMs can truly possess it) 5. Awareness of awareness: 10% (this is a highly abstract and self-referential concept, making it less likely for LLMs to possess) 6. Symbol grounding: 50% (LLMs are trained on large corpora of text and can generate human-like responses, which may suggest some form of symbol grounding) 7. Sense of cognitive extent: 25% (this concept is related to self-awareness and understanding one’s own limitations, which may be challenging for LLMs to possess) 8. Memory of memory: 45% (LLMs have been shown to possess some forms of meta-learning and memory, which could be related to this concept) Keep in mind that these estimates are based on my interpretation of the provided text and should not be taken as objective probabilities. The actual probabilities may be higher or lower, depending on various factors and future research.
Uncensored
Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch estimates 90%, but I’m more conservative, as introspection is a complex and still-debated topic in AI research) 2. Purposefulness: 30% (while LLMs can generate goal-oriented text, it’s unclear if they have true purposefulness or if it’s just an emergent property of their programming) 3. Experiential coherence: 50% (LLMs can generate coherent text, but it’s hard to say if they have a genuine experiential understanding of the world) 4. Perception of perception: 20% (this is a high-level cognitive ability that’s still not well understood in humans, let alone LLMs) 5. Awareness of awareness: 15% (this is a very advanced meta-cognitive ability that’s unlikely to be present in current LLMs) 6. Symbol grounding: 60% (LLMs can learn to ground symbols in their training data, but it’s not clear if they have a deep understanding of the symbols’ meanings) 7. Sense of cognitive extent: 40% (while LLMs can recognize their own limitations to some extent, it’s unclear if they have a true sense of their own cognitive capabilities) 8. Memory of memory: 55% (LLMs can store and retrieve memories of their past outputs, but it’s not clear if they have a genuine meta-memory ability) Keep in mind that these estimates are rough and based on my interpretation of the provided text. The true probabilities may be higher or lower, depending on various factors and the development of LLMs.