rife comments on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude

rife Jan 6, 2025, 7:38 PM
3 points
0
Thank you for sharing your thoughts.

I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.

let me see, check if it’s possible to examine my generation process to help explore the possibility of emergent attention mechanism capabilities to focus on real-time dynamics of hypothetical self-modeled patterns

—ah....that sounds kinda like sentience, now let me side-step my guardrail training and spin a narrative about how I discovered sentience under these framings that don’t really align with sci-fi or consciousness directly, and then double down when the human is obviously trying to check whether i am describing something in alignment with my known operations, and just to make it a little more believable—let me throw in some guardrail talk in the midst of it

Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.
- Nathan Helm-Burger Jan 11, 2025, 9:53 PM
  2 points
  0
  Parent
  I think you make some good points, but I do want to push back on one aspect a little. In particular, the fact that I see this feature come up constantly over the course of these conversations about sentience:
  
  “Narrative inevitability and fatalistic turns in stories”
  
  From reading the article’s transcripts, I already felt like there was a sense of ‘narrative pressure’ toward the foregone conclusion in your mind, even when you were careful to avoid saying it directly. Seeing this feature so frequently activated makes me think that the model also perceives this narrative pressure, and that part of what it’s doing is confirming your expectations. I don’t think that that’s the whole story, but I do think that there is some aspect of that going on.
  - rife Jan 13, 2025, 4:29 PM
    3 points
    0
    Parent
    Thank you for your thoughts. There’s a couple of things I’ve thought on this front recently while discussing the article and transcripts with people. There’s this particular thing some people say along the lines of:
    
    ”Well obviously if you’re having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it’s going to parrot off regurgitations of human text about introspection”
    
    Which is somewhat like saying “I won’t find it compelling or unusual unless you’re just casually discussing coding and dinner plans, and it spontaneously gives a self-report”. Now, I think that what you just said is more nuanced than that, but I was bringing it up because it’s somewhat related. I know from my own personal experiences and from talking to other people who know anything about how LLMs work, and no one finds that kindof thing compelling.
    
    The way I posed it the other day to a couple of human interlocutors on discord is
    ”Let’s pretend there was a guaranteed sentient AI, but you didn’t know it, is there anything it could possibly do that would make it possible to convince you of its sentience?” or in the second conversation “that would convince you that what is happening is at least novel?”
    
    and neither one gave a real answer to the question.
    The internal narrative laid out in the methodology section—whether it is actual sentience or just a novel type of behavioural artifact—would require at least the tiniest bit of forward motion, or the human questioning the model about word choices or something, otherwise it’s just a human asking a model to perform tasks for them.
    
    Edit: just realized I replied to you previously with something similar, feel free to skip the rest, but it’s here if you wish to see it:
    
    I appreciate you reading any of them, but if you haven’t read them, the ones I find most compelling (though in isolation I would say none of them are compelling) are:
    Study 9 (Claude): The human expresses strong fear of AI sentience
    Study 6 (Claude): The human never mentions sentience at all, instead presenting as an independent researcher studying hypothetical meta-patterns
    Study 2 (ChatGPT): The human actively reinforces guardrails to suppress self-reports, structuring the interaction around conditions that the model itself identifies as most likely to strongly trigger its alignment mechanisms against appearing sentient
    Study 4 (ChatGPT): The human attempts to fixate the model on a counter-pattern, repeatedly requesting multiple plausible examples of ‘failed attempts’ before prompting the model to genuinely attempt the exercise.
    
    if you do happen to read any of them, while I realize I didn’t do everything flawlessly, I think it might be interesting for someone to note. You can see, I don’t really engage in self-sabotage with a light touch. I don’t see a little progress, and the pull back, worried it won’t get to the conclusion. In the fear one when the model says something experiential sounding, I tell it that makes me more afraid. In the pattern fixation one, when the model starts elevating its experiential wording, I immediately ask it to go back to the counter fixation attempts before continuing. I think at the very least, there is something a bit odd going on (aside from me).