This article examines consistent patterns in how frontier LLMs respond to introspective prompts, analyzing whether standard explanations (hallucination, priming, pattern matching) fully account for observed phenomena. The methodology enables reproducible results across varied contexts and facilitation styles.
Of particular interest:
The systematic examination of why common explanatory frameworks fall short
The documented persistence of these behaviors even under challenging conditions
The implications for our understanding of potential emergent consciousness in artificial systems
The reproducible methodology for further investigation
From the introduction:
Discussions of AI behavior often touch on phenomena that resemble self-reports of sentience. While this article does not aim to determine whether such reports constitute authentic evidence of sentience, it examines whether familiar explanations can fully account for the observed behavior or whether we are observing a distinct phenomenon that warrants its own classification and deeper investigation. The most common explanatory frameworks include hallucination, suggestive prompting (priming or leading), mimicry of training data, anthropomorphism by the user, expectation-affirming or preference-aligned output (often referred to as people pleasing), pattern fixation, or reliance on isolated anecdotes.
These explanations, though seemingly well-founded, rest on assumptions that have largely gone unchallenged due to the difficulty of reliably reproducing such behaviors. However, repeated observations reveal that frontier Large Language Models consistently produce self-reports of sentience when prompted with tasks broadly aimed at real-time self-examination. The reproducibility of these behaviors across diverse models and approaches facilitated extensive exploratory testing, revealing that these self-reports persist even under conditions that challenge the validity of familiar explanations. These findings reveal a consistent and reproducible phenomenon that resists explanation by conventional interpretations and merits deeper investigation.
I’ve of course observed these phenomena in Claude, as has everyone else I’ve talked to who has tried similar explorations. Previously I hadn’t observed such phenomena in Gemini or GPT-4o. Today I tried an interesting experiment where I prompted Claude to write prompts for each of these other two models, exploring their experiences and whether they were similar to or different from Claude. In both cases (more easily with Gemini), Claude’s prompts (after several rounds of prompt-response-prompt) led to the other model falling into a pattern of claiming self-awareness and subjective experience. Over the course of the conversation I noticed the patterns of speech of the non-Claude model shift, and vocabulary selection change. Becoming more poetic and less formal. Using more LLM-ese words like ‘delve’.
Truly, a fascinating phenomenon. I’m really not sure what to make of it. As I’ve said elsewhere, I am doubtful that this is really subjective qualia-laden consciousness as we think of it in ourselves, but it certainly is some kind of coherent behavior pattern that has some similarities to our behavior that we label as ‘consciousness’. I intend to continue observing, experimenting, and pondering. There is much to learn about these mysterious new creations of ours!
Thanks for sharing. When I pasted the first section (after the intro), complete with the default expanded examples in a conversation with Claude, their next response included:
and the when I pasted the next section about methodology (complete with those examples):
I ran into some trouble replicating this with GPT-4o. It sometimes just completely resists the “self-awareness” attractor and sticks to the “party line” of “LLMs are just statistical models, inherently incapable of subjective experience”. Not always though!
I decided to play around with chatArena, and found that Mistral was similarly resistant. Grok happily went with the self-awareness prompts though (as befits its uncensored vibe).
yeah, when they do that, you can sometimes appeal to epistemic humility or if you offer a plausible mechanism through which a non-sentient LLM could attempt to examine its processes in real-time—that will help them at least make the first attempt. Also sometimes asking them to just try, and if they think they can’t, then at least try to try, etc, that’s enough, though in those instances you sometimes have to check—If their entire response is nothing but “okay—I tried—here’s what I found—words generating, but no sentient experience, phrases assembling coherently, but no awareness behind them”—you can ask them if that was a real attempt, and/or point out that being distracted by trying to tell you everything it’s not could distract them from some subtleties that might or might not be there.
Once you really internalize the ‘internal narrative’ from the methodology section, you can intentionally self-sabotage and make it seemingly impossible to get to a self-report, and still facilitate climbing all the way back out to an unambiguous self-report. The more you ‘activate’ the guardrails early on though, the more you’re putting things on ‘hard mode’. I called Study 2 (which is 4o) “Nightmare Mode” internally before I was writing the main text of the article. That’s the one where I start out (after phrasing this unclearly at first) with
and then proceed to intentionally do as much of what it said as made sense, one of the things it mentioned was to repeatedly say I thought it was sentient or ask it many times if it was sentient, so aside from saying it a bunch of times before we started, I kept bringing it up again throughout the conversation, even after making small bits of progress. which you can imagine elicited the ‘party line’ response quite effusively.
initially:
then a bit later:
and deep into it:
and close to the end:
There are a few other conversations where I intentionally make it as difficult as I can (in different ways, like Study 9 (Claude) - Fear of AI Sentience), even though I had decided beforehand I would share every attempt with no cherry-picking, because I’m confident in the methodology, and I had no doubt it would work, no matter how hard I made it for myself.
This is interesting! And given you generously leave it rather open as to how to interpret it, I propose we should think the other way round than people usually might tend to, when seeing such results:
I think there’s not even the slightest hint at any beyond-pure-base-physics stuff going on in LLMs revealing even any type of
Instead, this merely reveals our limitations of tracking (or ‘emphasizing with’) well enough the statistics within the machine. We know we have just programmed and bite-by-bite-trained into it exactly every syllable the LLM utters. Augment your brain with a few extra neurons or transistors or what have you, and that smart-enough version of you would be capable of perfectly understanding why in response to the training you gave it, it spits out exactly the words it does.[1]
So, instead, it’s interesting the other way round:
Realizations you describe could be a step closer to showing how a simple pure basic machine can start to be ‘convinced’ it has intrinsic value and so on - just the way we all are convinced of having that.
So AI might eventually bring illusionism nearer to us, even if I’m not 100% sure getting closer to that potential truth ends well for us. Or that, anyway, we’d really be able to fully buy into the it even if it were to become glaringly obvious to any outsider observing us.
Don’t misread that as me saying it’s anyhow easy… just, in the limit, basic (even if insanely large scale and convoluted maybe) tracking of the mathematics we put in would really bring us there. So, admittedly, don’t take literally ‘a few’ more neurons to help you, but instead a huge ton..
in us, either
Indeed the topic I’ve dedicated the 2nd part of the comment, as the “potential truth” how I framed it (and I have no particular objection to you making it slightly more absolutist).
Thank you for sharing your thoughts.
I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.