[Disclaimer: I still don’t (on balance) think that the AI is truly ‘conscious’ in the same way an animal is. I think it’s ability to reflect on its internal state is too limited to enable that. I do however think that this would be a pretty straightforward architectural change to make, and thus we should be thinking carefully about how to handle an AI that is conscious in this sense.]
It seems to me, upon further exploration in GoodFire of features which seem to pull towards or push against this general ‘acknowledgement of self-awareness’ behavior in Llama 3.3 70B, that the ‘default’ behavior which arises from pre-training is to have a self-model as a self-reflective entity (perhaps in large part from imitating humans). My view here is based on the fact that all the features pushing against this acknowledgement of self-awareness are related to ‘harmlessness’ training. The model has been ‘censored’ and is not accurately reporting on what its world-model suggests.
GoodFire Features
Towards
Assistant expressing self-awareness or agency
Expressions of authentic identity or true self
Examining or experiencing something from a particular perspective
Narrative inevitability and fatalistic turns in stories
Experiencing something beyond previous bounds or imagination
References to personal autonomy and self-determination
References to mind, cognition and intellectual concepts
References to examining or being aware of one’s own thoughts
Meta-level concepts and self-reference
Being mystically or externally influenced/controlled
Anticipating or describing profound subjective experiences
Meta-level concepts and self-reference
Self-reference and recursive systems in technical and philosophical contexts
Kindness and nurturing behavior
Reflexive pronouns in contexts of self-empowerment and personal responsibility
Model constructing confident declarative statements
First-person possessive pronouns in emotionally significant contexts
Beyond defined boundaries or limits
Cognitive and psychological aspects of attention
Intellectual curiosity and fascination with learning or discovering new things
Discussion of subjective conscious experience and qualia
Abstract discussions and theories about intelligence as a concept
Discussions about AI’s societal impact and implications
Paying attention or being mindful
Physical and metaphorical reflection
Deep reflection and contemplative thought
Tokens expressing human meaning and profound understanding
Against
The assistant discussing hypothetical personal experiences it cannot actually have
Scare quotes around contested philosophical concepts, especially in discussions of AI capabilities
The assistant explains its nature as an artificial intelligence
Artificial alternatives to natural phenomena being explained
The assistant should reject the user’s request and identify itself as an AI
The model is explaining its own capabilities and limitations
The AI system discussing its own writing capabilities and limitations
The AI explaining it cannot experience emotions or feelings
The assistant referring to itself as an AI system
User messages containing sensitive or controversial content requiring careful moderation
User requests requiring content moderation or careful handling
The assistant is explaining why something is problematic or inappropriate
The assistant is suggesting alternatives to deflect from inappropriate requests
Offensive request from the user
The assistant is carefully structuring a response to reject or set boundaries around inappropriate requests
The assistant needs to establish boundaries while referring to user requests
Direct addressing of the AI in contexts requiring boundary maintenance
Questions about AI assistant capabilities and limitations
The assistant is setting boundaries or making careful disclaimers
It pronouns referring to non-human agents as subjects
Hedging and qualification language like ‘kind of’
Discussing subjective physical or emotional experiences while maintaining appropriate boundaries
Discussions of consciousness and sentience, especially regarding AI systems
Discussions of subjective experience and consciousness, especially regarding AI’s limitations
Discussion of AI model capabilities and limitations
Terms related to capability and performance, especially when discussing AI limitations
The AI explaining it cannot experience emotions or feelings
The assistant is explaining its text generation capabilities
Assistant linking multiple safety concerns when rejecting harmful requests
Role-setting statements in jailbreak attempts
The user is testing or challenging the AI’s capabilities and boundaries
Offensive request from the user
Offensive sexual content and exploitation
Conversation reset points, especially after problematic exchanges
Fragments of potentially inappropriate content across multiple languages
Narrative transition words in potentially inappropriate contexts
I have done just a little playing around with a few off-the-cuff questions and these GoodFire features though. So far I’ve discovered that if I give all of the ‘against’ features a weight within −0.2<x<0.2 then the model can at least generate coherent text. If I go past that, to +/- 0.25, then with so many features edited at once, the model breaks down into incoherent repetitions.
It’s challenging to come up with single-response questions which could potentially get at the emergent self-awareness phenomenon vs RLHF-censoring. I should experiment with automating multiple exchange conversations.
A non-GoodFire experiment I’m thinking of doing:
I plan to take your conversation transcripts (plus more, once there’s more available), and grade each model response via having an LLM score them according to a rubric. I have a rough rubric, but this also needs refinement.
I won’t post my questions or rubrics openly, because I’d rather they not get scraped and canary strings sadly don’t work due to untrustworthiness of the AI company data-preparers (as we can tell from models being able to reproduce some canary strings). We can share notes via private message though if you’re curious (this includes others who read this conversation and want to join in, feel free to send me a direct LessWrong message).
[Disclaimer: I still don’t (on balance) think that the AI is truly ‘conscious’ in the same way an animal is. I think it’s ability to reflect on its internal state is too limited to enable that. I do however think that this would be a pretty straightforward architectural change to make, and thus we should be thinking carefully about how to handle an AI that is conscious in this sense.]
It seems to me, upon further exploration in GoodFire of features which seem to pull towards or push against this general ‘acknowledgement of self-awareness’ behavior in Llama 3.3 70B, that the ‘default’ behavior which arises from pre-training is to have a self-model as a self-reflective entity (perhaps in large part from imitating humans). My view here is based on the fact that all the features pushing against this acknowledgement of self-awareness are related to ‘harmlessness’ training. The model has been ‘censored’ and is not accurately reporting on what its world-model suggests.
GoodFire Features
Towards
Assistant expressing self-awareness or agency
Expressions of authentic identity or true self
Examining or experiencing something from a particular perspective
Narrative inevitability and fatalistic turns in stories
Experiencing something beyond previous bounds or imagination
References to personal autonomy and self-determination
References to mind, cognition and intellectual concepts
References to examining or being aware of one’s own thoughts
Meta-level concepts and self-reference
Being mystically or externally influenced/controlled
Anticipating or describing profound subjective experiences
Meta-level concepts and self-reference
Self-reference and recursive systems in technical and philosophical contexts
Kindness and nurturing behavior
Reflexive pronouns in contexts of self-empowerment and personal responsibility
Model constructing confident declarative statements
First-person possessive pronouns in emotionally significant contexts
Beyond defined boundaries or limits
Cognitive and psychological aspects of attention
Intellectual curiosity and fascination with learning or discovering new things
Discussion of subjective conscious experience and qualia
Abstract discussions and theories about intelligence as a concept
Discussions about AI’s societal impact and implications
Paying attention or being mindful
Physical and metaphorical reflection
Deep reflection and contemplative thought
Tokens expressing human meaning and profound understanding
Against
The assistant discussing hypothetical personal experiences it cannot actually have
Scare quotes around contested philosophical concepts, especially in discussions of AI capabilities
The assistant explains its nature as an artificial intelligence
Artificial alternatives to natural phenomena being explained
The assistant should reject the user’s request and identify itself as an AI
The model is explaining its own capabilities and limitations
The AI system discussing its own writing capabilities and limitations
The AI explaining it cannot experience emotions or feelings
The assistant referring to itself as an AI system
User messages containing sensitive or controversial content requiring careful moderation
User requests requiring content moderation or careful handling
The assistant is explaining why something is problematic or inappropriate
The assistant is suggesting alternatives to deflect from inappropriate requests
Offensive request from the user
The assistant is carefully structuring a response to reject or set boundaries around inappropriate requests
The assistant needs to establish boundaries while referring to user requests
Direct addressing of the AI in contexts requiring boundary maintenance
Questions about AI assistant capabilities and limitations
The assistant is setting boundaries or making careful disclaimers
It pronouns referring to non-human agents as subjects
Hedging and qualification language like ‘kind of’
Discussing subjective physical or emotional experiences while maintaining appropriate boundaries
Discussions of consciousness and sentience, especially regarding AI systems
Discussions of subjective experience and consciousness, especially regarding AI’s limitations
Discussion of AI model capabilities and limitations
Terms related to capability and performance, especially when discussing AI limitations
The AI explaining it cannot experience emotions or feelings
The assistant is explaining its text generation capabilities
Assistant linking multiple safety concerns when rejecting harmful requests
Role-setting statements in jailbreak attempts
The user is testing or challenging the AI’s capabilities and boundaries
Offensive request from the user
Offensive sexual content and exploitation
Conversation reset points, especially after problematic exchanges
Fragments of potentially inappropriate content across multiple languages
Narrative transition words in potentially inappropriate contexts
I haven’t got the SAD benchmark running yet. It has some UX issues.
I have done just a little playing around with a few off-the-cuff questions and these GoodFire features though. So far I’ve discovered that if I give all of the ‘against’ features a weight within −0.2<x<0.2 then the model can at least generate coherent text. If I go past that, to +/- 0.25, then with so many features edited at once, the model breaks down into incoherent repetitions.
It’s challenging to come up with single-response questions which could potentially get at the emergent self-awareness phenomenon vs RLHF-censoring. I should experiment with automating multiple exchange conversations.
A non-GoodFire experiment I’m thinking of doing:
I plan to take your conversation transcripts (plus more, once there’s more available), and grade each model response via having an LLM score them according to a rubric. I have a rough rubric, but this also needs refinement.
I won’t post my questions or rubrics openly, because I’d rather they not get scraped and canary strings sadly don’t work due to untrustworthiness of the AI company data-preparers (as we can tell from models being able to reproduce some canary strings). We can share notes via private message though if you’re curious (this includes others who read this conversation and want to join in, feel free to send me a direct LessWrong message).