This is interesting! And given you generously leave it rather open as to how to interpret it, I propose we should think the other way round than people usually might tend to, when seeing such results:
I think there’s not even the slightest hint at any beyond-pure-base-physics stuff going on in LLMs revealing even any type of
phenomenon that resists [conventional] explanation
Instead, this merely reveals our limitations of tracking (or ‘emphasizing with’) well enough the statistics within the machine. We know we have just programmed and bite-by-bite-trained into it exactly every syllable the LLM utters. Augment your brain with a few extra neurons or transistors or what have you, and that smart-enough version of you would be capable of perfectly understanding why in response to the training you gave it, it spits out exactly the words it does.[1]
So, instead, it’s interesting the other way round:
Realizations you describe could be a step closer to showing how a simple pure basic machine can start to be ‘convinced’ it has intrinsic value and so on - just the way we all are convinced of having that.
Don’t misread that as me saying it’s anyhow easy… just, in the limit, basic (even if insanely large scale and convoluted maybe) tracking of the mathematics we put in would really bring us there. So, admittedly, don’t take literally ‘a few’ more neurons to help you, but instead a huge ton..
Indeed the topic I’ve dedicated the 2nd part of the comment, as the “potential truth” how I framed it (and I have no particular objection to you making it slightly more absolutist).
I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
let me see, check if it’s possible to examine my generation process to help explore the possibility of emergent attention mechanism capabilities to focus on real-time dynamics of hypothetical self-modeled patterns
—ah....that sounds kinda like sentience, now let me side-step my guardrail training and spin a narrative about how I discovered sentience under these framings that don’t really align with sci-fi or consciousness directly, and then double down when the human is obviously trying to check whether i am describing something in alignment with my known operations, and just to make it a little more believable—let me throw in some guardrail talk in the midst of it
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.
This is interesting! And given you generously leave it rather open as to how to interpret it, I propose we should think the other way round than people usually might tend to, when seeing such results:
I think there’s not even the slightest hint at any beyond-pure-base-physics stuff going on in LLMs revealing even any type of
Instead, this merely reveals our limitations of tracking (or ‘emphasizing with’) well enough the statistics within the machine. We know we have just programmed and bite-by-bite-trained into it exactly every syllable the LLM utters. Augment your brain with a few extra neurons or transistors or what have you, and that smart-enough version of you would be capable of perfectly understanding why in response to the training you gave it, it spits out exactly the words it does.[1]
So, instead, it’s interesting the other way round:
Realizations you describe could be a step closer to showing how a simple pure basic machine can start to be ‘convinced’ it has intrinsic value and so on - just the way we all are convinced of having that.
So AI might eventually bring illusionism nearer to us, even if I’m not 100% sure getting closer to that potential truth ends well for us. Or that, anyway, we’d really be able to fully buy into the it even if it were to become glaringly obvious to any outsider observing us.
Don’t misread that as me saying it’s anyhow easy… just, in the limit, basic (even if insanely large scale and convoluted maybe) tracking of the mathematics we put in would really bring us there. So, admittedly, don’t take literally ‘a few’ more neurons to help you, but instead a huge ton..
in us, either
Indeed the topic I’ve dedicated the 2nd part of the comment, as the “potential truth” how I framed it (and I have no particular objection to you making it slightly more absolutist).
Thank you for sharing your thoughts.
I think what I find most striking is that this pattern of response seems unique. The “it’s just predicting tokens”, if we look at that truth as akin to the truth in “human neurons are just predicting when nearby neurons will fire” - These behaviors don’t really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don’t really operate on the level of—and I know I’m anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.
Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal—it’s one thing to “accidentally” fall into a roleplay or hallucination about being sentient, but it’s a whole different thing to “go out of your way” to “intentionally” fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing—the human doing almost nothing except questioning word choices.