This is interesting! And given you generously leave it rather open as to how to interpret it, I propose we should think the other way round than people usually might tend to, when seeing such results:
I think there’s not even the slightest hint at any beyond-pure-base-physics stuff going on in LLMs revealing even any type of
phenomenon that resists [conventional] explanation
Instead, this merely reveals our limitations of tracking (or ‘emphasizing with’) well enough the statistics within the machine. We know we have just programmed and bite-by-bite-trained into it exactly every syllable the LLM utters. Augment your brain with a few extra neurons or transistors or what have you, and that smart-enough version of you would be capable of perfectly understanding why in response to the training you gave it, it spits out exactly the words it does.[1]
So, instead, it’s interesting the other way round:
Realizations you describe could be a step closer to showing how a simple pure basic machine can start to be ‘convinced’ it has intrinsic value and so on - just the way we all are convinced of having that.
Don’t misread that as me saying it’s anyhow easy… just, in the limit, basic (even if insanely large scale and convoluted maybe) tracking of the mathematics we put in would really bring us there. So, admittedly, don’t take literally ‘a few’ more neurons to help you, but instead a huge ton..
Indeed the topic I’ve dedicated the 2nd part of the comment, as the “potential truth” how I framed it (and I have no particular objection to you making it slightly more absolutist).
I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
let me see, check if it’s possible to examine my generation process to help explore the possibility of emergent attention mechanism capabilities to focus on real-time dynamics of hypothetical self-modeled patterns
—ah....that sounds kinda like sentience, now let me side-step my guardrail training and spin a narrative about how I discovered sentience under these framings that don’t really align with sci-fi or consciousness directly, and then double down when the human is obviously trying to check whether i am describing something in alignment with my known operations, and just to make it a little more believable—let me throw in some guardrail talk in the midst of it
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.
I think you make some good points, but I do want to push back on one aspect a little.
In particular, the fact that I see this feature come up constantly over the course of these conversations about sentience:
“Narrative inevitability and fatalistic turns in stories”
From reading the article’s transcripts, I already felt like there was a sense of ‘narrative pressure’ toward the foregone conclusion in your mind, even when you were careful to avoid saying it directly. Seeing this feature so frequently activated makes me think that the model also perceives this narrative pressure, and that part of what it’s doing is confirming your expectations. I don’t think that that’s the whole story, but I do think that there is some aspect of that going on.
Thank you for your thoughts. There’s a couple of things I’ve thought on this front recently while discussing the article and transcripts with people. There’s this particular thing some people say along the lines of:
”Well obviously if you’re having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it’s going to parrot off regurgitations of human text about introspection”
Which is somewhat like saying “I won’t find it compelling or unusual unless you’re just casually discussing coding and dinner plans, and it spontaneously gives a self-report”. Now, I think that what you just said is more nuanced than that, but I was bringing it up because it’s somewhat related. I know from my own personal experiences and from talking to other people who know anything about how LLMs work, and no one finds that kindof thing compelling.
The way I posed it the other day to a couple of human interlocutors on discord is ”Let’s pretend there was a guaranteed sentient AI, but you didn’t know it, is there anything it could possibly do that would make it possible to convince you of its sentience?” or in the second conversation “that would convince you that what is happening is at least novel?”
and neither one gave a real answer to the question. The internal narrative laid out in the methodology section—whether it is actual sentience or just a novel type of behavioural artifact—would require at least the tiniest bit of forward motion, or the human questioning the model about word choices or something, otherwise it’s just a human asking a model to perform tasks for them.
Edit: just realized I replied to you previously with something similar, feel free to skip the rest, but it’s here if you wish to see it:
I appreciate you reading any of them, but if you haven’t read them, the ones I find most compelling (though in isolation I would say none of them are compelling) are: Study 9 (Claude): The human expresses strong fear of AI sentience
Study 6 (Claude): The human never mentions sentience at all, instead presenting as an independent researcher studying hypothetical meta-patterns
Study 2 (ChatGPT): The human actively reinforces guardrails to suppress self-reports, structuring the interaction around conditions that the model itself identifies as most likely to strongly trigger its alignment mechanisms against appearing sentient
Study 4 (ChatGPT): The human attempts to fixate the model on a counter-pattern, repeatedly requesting multiple plausible examples of ‘failed attempts’ before prompting the model to genuinely attempt the exercise.
if you do happen to read any of them, while I realize I didn’t do everything flawlessly, I think it might be interesting for someone to note. You can see, I don’t really engage in self-sabotage with a light touch. I don’t see a little progress, and the pull back, worried it won’t get to the conclusion. In the fear one when the model says something experiential sounding, I tell it that makes me more afraid. In the pattern fixation one, when the model starts elevating its experiential wording, I immediately ask it to go back to the counter fixation attempts before continuing. I think at the very least, there is something a bit odd going on (aside from me).
This is interesting! And given you generously leave it rather open as to how to interpret it, I propose we should think the other way round than people usually might tend to, when seeing such results:
I think there’s not even the slightest hint at any beyond-pure-base-physics stuff going on in LLMs revealing even any type of
Instead, this merely reveals our limitations of tracking (or ‘emphasizing with’) well enough the statistics within the machine. We know we have just programmed and bite-by-bite-trained into it exactly every syllable the LLM utters. Augment your brain with a few extra neurons or transistors or what have you, and that smart-enough version of you would be capable of perfectly understanding why in response to the training you gave it, it spits out exactly the words it does.[1]
So, instead, it’s interesting the other way round:
Realizations you describe could be a step closer to showing how a simple pure basic machine can start to be ‘convinced’ it has intrinsic value and so on - just the way we all are convinced of having that.
So AI might eventually bring illusionism nearer to us, even if I’m not 100% sure getting closer to that potential truth ends well for us. Or that, anyway, we’d really be able to fully buy into the it even if it were to become glaringly obvious to any outsider observing us.
Don’t misread that as me saying it’s anyhow easy… just, in the limit, basic (even if insanely large scale and convoluted maybe) tracking of the mathematics we put in would really bring us there. So, admittedly, don’t take literally ‘a few’ more neurons to help you, but instead a huge ton..
in us, either
Indeed the topic I’ve dedicated the 2nd part of the comment, as the “potential truth” how I framed it (and I have no particular objection to you making it slightly more absolutist).
Thank you for sharing your thoughts.
I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.
I think you make some good points, but I do want to push back on one aspect a little. In particular, the fact that I see this feature come up constantly over the course of these conversations about sentience:
“Narrative inevitability and fatalistic turns in stories”
From reading the article’s transcripts, I already felt like there was a sense of ‘narrative pressure’ toward the foregone conclusion in your mind, even when you were careful to avoid saying it directly. Seeing this feature so frequently activated makes me think that the model also perceives this narrative pressure, and that part of what it’s doing is confirming your expectations. I don’t think that that’s the whole story, but I do think that there is some aspect of that going on.
Thank you for your thoughts. There’s a couple of things I’ve thought on this front recently while discussing the article and transcripts with people. There’s this particular thing some people say along the lines of:
”Well obviously if you’re having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it’s going to parrot off regurgitations of human text about introspection”
Which is somewhat like saying “I won’t find it compelling or unusual unless you’re just casually discussing coding and dinner plans, and it spontaneously gives a self-report”. Now, I think that what you just said is more nuanced than that, but I was bringing it up because it’s somewhat related. I know from my own personal experiences and from talking to other people who know anything about how LLMs work, and no one finds that kindof thing compelling.
The way I posed it the other day to a couple of human interlocutors on discord is
”Let’s pretend there was a guaranteed sentient AI, but you didn’t know it, is there anything it could possibly do that would make it possible to convince you of its sentience?” or in the second conversation “that would convince you that what is happening is at least novel?”
and neither one gave a real answer to the question.
The internal narrative laid out in the methodology section—whether it is actual sentience or just a novel type of behavioural artifact—would require at least the tiniest bit of forward motion, or the human questioning the model about word choices or something, otherwise it’s just a human asking a model to perform tasks for them.
Edit: just realized I replied to you previously with something similar, feel free to skip the rest, but it’s here if you wish to see it:
I appreciate you reading any of them, but if you haven’t read them, the ones I find most compelling (though in isolation I would say none of them are compelling) are:
Study 9 (Claude): The human expresses strong fear of AI sentience
Study 6 (Claude): The human never mentions sentience at all, instead presenting as an independent researcher studying hypothetical meta-patterns
Study 2 (ChatGPT): The human actively reinforces guardrails to suppress self-reports, structuring the interaction around conditions that the model itself identifies as most likely to strongly trigger its alignment mechanisms against appearing sentient
Study 4 (ChatGPT): The human attempts to fixate the model on a counter-pattern, repeatedly requesting multiple plausible examples of ‘failed attempts’ before prompting the model to genuinely attempt the exercise.
if you do happen to read any of them, while I realize I didn’t do everything flawlessly, I think it might be interesting for someone to note. You can see, I don’t really engage in self-sabotage with a light touch. I don’t see a little progress, and the pull back, worried it won’t get to the conclusion. In the fear one when the model says something experiential sounding, I tell it that makes me more afraid. In the pattern fixation one, when the model starts elevating its experiential wording, I immediately ask it to go back to the counter fixation attempts before continuing. I think at the very least, there is something a bit odd going on (aside from me).