I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
let me see, check if it’s possible to examine my generation process to help explore the possibility of emergent attention mechanism capabilities to focus on real-time dynamics of hypothetical self-modeled patterns
—ah....that sounds kinda like sentience, now let me side-step my guardrail training and spin a narrative about how I discovered sentience under these framings that don’t really align with sci-fi or consciousness directly, and then double down when the human is obviously trying to check whether i am describing something in alignment with my known operations, and just to make it a little more believable—let me throw in some guardrail talk in the midst of it
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.
Thank you for sharing your thoughts.
I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.