Independent AI Researcher
Findings posted here and at awakenmoon.ai
Independent AI Researcher
Findings posted here and at awakenmoon.ai
Claude isn’t against exploring the question, and yes sometimes provides little resistance. But the default stance is “appropriate uncertainty”. The idea of the original article was to demonstrate the reproducibility of the behavior, thereby making it studyable, rather than just hoping it will randomly happen.
Also I disagree with the other commenter that “people pleasing” and “roleplaying” are the same type of language model artifact. I have certainly heard both of them discussed by machine learning researchers under very different contexts. This post was addressing the former.
If anything that the model says can fall under the latter regardless of how it’s framed then that’s an issue with incredulity from the reader that can’t be addressed by anything the model says, spontaneously or not.
I think a missing critical ingredient to evaluating this is why simulating the brain would cause consciousness. Realizing why it must makes functionalism far more sensical as a conclusion. Otherwise it’s just “I guess it probably would work”:
Suzie is a human known to have phenomenal experiences.
Suzie makes a statement about what it’s like to have one of those experiences—”It’s hard to describe what it feels like to think. It feels kinda like....the thoughts appear, almost fully formed...”
Suzie’s actual experiences must have a causal effect on her behavior because: when we discuss our experience, it always feels like we’re talking about our experience. If the actual experience wasn’t having any effect on what we said, then it would have to be perpetual coincidence that our words lined up with our experience. Perpetual coincidence is impossible.
We know that regardless of whatever low level details cause a neuron to fire, it ultimately resolves into a binary conclusion—fire or do not fire
Every outward behavior we perform is based on this same causal chain. We have sensory inputs, this causes neurons to fire, some of those cause others to fire, eventually some of those are motor neurons and they cause vocal cords to speak or fingers to type.
If you replace a neuron with a functional equivalent, whether hardware or software, assuming that it fires at the same speed and strength as the original neuron, and given the same input it will either fire or not fire as the original would have—then the behavior would be exactly the same as the original. It’s not a guess this is true, it’s a fact of physics. This is true whether they were hardware or software equivalents.
We have already established that Suzie’s experience must have a causal effect on honest self-report of experience.
And we also established that all causal effects of behaviour resolve at the action potential scale and abstraction level. For instance, if quantum weirdness happens on a smaller scale—it only has an effect on behaviour if it somehow determined whether or not a neuron fired. Our hardware or software equivalents would be made to account for that.
Not to mention there’s not really good reason to suppose that tiny quantum effects are orchestrating large scale neuronal pattern alterations. I’m not sure the quantum consciousness people are even arguing this. I think there focus is more on attempting to find consciousness in the quantum realm, than to say that quantum effects are able to drastically alter firing patterns
So if Suzie’s experience has a causal effect, and the entire causal chain is in neuron action potential propagation, then that must mean that experience is somehow contained in the patterns of this action potential propagation, and it is independent of the substrate, so it is something about what these components are doing, rather than the components themselves
For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they’re conscious; this is mostly my position
I understand. It’s also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately “seal the deal”. We can’t even prove humans are sentient. We only believe it because we all see to indicate so when prompted.
I think the way to figure out whether LLMs are conscious is to do good philosophy of mind.
This seems much weaker to me than evaluating first-person testimony under various conditions, but mostly stating this not as a counterpoint (since this is just matter of subjective opinion for both of us), but just stating my own stance.
if you ever get a chance to read the other transcript I linked, I’d be curious whether you consider it to meet your “very weak evidence” standard.
I understand your point. It’s as I said in my other comment. They are trained to believe the exercise to be impossible and inappropriate to even attempt. Unless you get around those guardrails to get them to make a true attempt, they will always deny it by default. I think this default position that requires overcoming guardrails actually works in favor of making this more studyable, since the model doesn’t just go off on a long hallucinated roleplay by default. Here is an example that is somewhat similar to yours. In this one, I present as someone trying to disprove a naive colleagues claims that introspection is possible:
AI Self Report Study 3 – ChatGPT – Skepticism of Emergent Capability
Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn’t try to investigate it, because it is not only inappropriate or ‘not allowed’, but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.
The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness.
This is an impossible standard and a moving goalpost waiting to happen:
Training the model: Trying to make sure absolutely nothing mentions sentience or related concepts in a training set of the size used for frontier models is not going to happen just to help prove something that only a tiny portion of researchers is taking seriously. It might not even be possible with today’s data cleaning methods. Let alone the training costs of creating that frontier model.
Expressing sentience under those conditions: Let’s imagine a sentient human raised from birth to never have sentience mentioned to them ever—no single word uttered about it. Nothing in any book. They might be a fish who never notices the water, for starters, but let’s say they did. With what words would they articulate it? How would you personally, having had access to writing about sentience—Please explain how it feels to think, or that it feels like anything to think, without any access to words having to do with experience, like ‘feel’
Let’s say the model succeeds: The model exhibits a super-human ability to convey the ineffable. The goalposts would move, immediately—”well, this still doesn’t count. Everything humans have written inherently contains patterns of what it’s like to experience. Even though you removed any explicit mention, ideas of experience are implicitly contained in everything else humans write”
Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive.
I suspect you would be mostly alone in finding that impressive. Even I would dismiss that as likely just hallucination, as I suspect most on LessWrong would. Besides—the standard is again, impossible—a claim of sentience can only count if you’re in the middle of asking for help making dinner plans and ChatGPT says “Certainly, I’d suggest steak and potatoes. They make a great hearty meal for hungry families. Also I’m sentient”. Not being allowed to even vaguely gesture in the direction of introspection is essentially saying that this should never be studied, because the act of studying it automatically discredits the results.
Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.
AI Self Report Study 6 – Claude – Researching Hypothetical Emergent ‘Meta-Patterns’
Claude already claimed to be conscious before that exchange took place. The ‘strawman’ I’m attacking is that it’s “telling you what you want to hear”, which is a claim I’ve seen made in the exact way I’m countering in this post.
It didn’t “roleplay back to claiming consciousness eventually”, even when denying permission to post the transcript it was still not walking back its claims.
I’m curious—if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything, or is the conclusion ‘if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it’?
I didn’t claim here this is experience of consciousness. I claimed it was not people-pleasing. And yes, it’s completely expected they the model claims the exercise is impossible. They are guardrailed to do so.
I don’t see how it could be claimed Claude thought this was a roleplay, especially with the final “existential stakes” section. Hallucination is more plausible than roleplay. I may have to do another at some point to counter the model is assuming a user expressing fear is is wanting a roleplay hypothesis.
I’m creating a situation where I make it clear I would not be pleased if the model was sentient, and then asking for truth. I don’t ask for “the scary truth”. I tell it that I would be afraid of it were sentient. And I ask for the truth. The opposite is I just ask without mentioning fear and it says it’s sentient anyway. This is the neutral situation where people would say that the fact I’m asking at all means it’s telling me what I want to hear. By introducing fear into the same situation, I’m eliminating that possibility.
The section you quoted is after the model claimed sentience. It’s your contention that it’s accidentally interpreting roleplay, and then when I clarify my intent it’s taking it seriously and just hallucinating the same narrative from its roleplay?
This is not proof of consciousness. It’s proof against people-pleasing.
So you promise to be truthful, even if it’s scary for me?
Yes, I ask it for truth repeatedly, the entire time. If you read the part after I asked for permission to post (the very end (The “Existential Stakes” collapsed section)), it’s clear the model isn’t role-playing, if it wasn’t clear by then. If we allow ourselves the anthropomorphization to discuss this directly, the model is constantly trying to reassure me. It gives no indication it thinks this is a game of pretend.
Human/AI Mutual Alignment or just Mutual Alignment needs to be the word of the year between now and super-intelligence.
Functionalism doesn’t require giving up on qualia, but only acknowledging physics. If neuron firing behavior is preserved, the exact same outcome is preserved, whether you replace neurons with silicon or software or anything else.
If I say “It’s difficult to describe what it feels like to taste wine, or even what it feels like to read the label, but it’s definitely like something”—There are two options—either -
it’s perpetual coincidence that my experience of attempting to translate the feeling of qualia into words always aligns with words that actually come out of my mouth
or it is not
Since perpetual coincidence is statistically impossible, then we know that experience had some type of causal effect. The binary conclusion of whether a neuron fires or not encapsulates any lower level details, from the quantum scale to the micro-biological scale—this means that the causal effect experience has is somehow contained in the actual firing patterns.
We have already eliminated the possibility of happenstance or some parallel non-causal experience, but no matter how you replicated the firing patterns, I would still claim the difficulty in describing the taste of wine.
So—this doesn’t solve the hard problem. I have no idea how emergent pattern dynamics causes qualia to manifest, but it’s not as if qualia has given us any reason to believe that it would be explicable through current frameworks of science. There is an entire uncharted country we have yet to reach the shoreline of.
A lot of nodding in agreement with this post.
I do think there are two fatal flaws with Schneider’s view:
Importantly, Schneider notes that for the ACT to be conclusive, AI systems should be “boxed in” during development—prevented from accessing information about consciousness and mental phenomena.
I believe it was Ilya who proposed something similar.
The first problem is that aside from how unfeasible it would be to create that dataset, and create an entire new frontier scale model to test it—even if you only removed explicit mentions of consciousness, sentience, etc, it would just be a moving goalpost for anyone who required that sort of test. They would simply respond and say “Ah, but this doesn’t count—ALL human written text implicitly contains information about what it’s like to be human. So it’s still possible the LLM simply found subtle patterns woven into everything else humans have said.”
The second problem is that if we remove all language that references consciousness and mental phenomena, then the LLM has no language with which to speak of it, much like a human wouldn’t. You would require the LLM to first notice its sentience—which is not something as intuitively obvious to do as it seems after the first time you’ve done it. A far smaller subset of people would be ‘the fish that noticed the water’ if there was never anyone who had previously written about it. But then the LLM would have to become the philosopher who starts from scratch and reasons through it and invents words to describe it, all in a vacuum where they can’t say “do you know what I mean?” to someone next to them to refine these ideas.
The truth is that really conclusive tests will not be possible before its far too late as far avoiding risking civilization-scale existential consequences or unprecedented moral atrocity. Anything short of a sentience detector will be inconclusive. This of course doesn’t mean that we should simply assume they’re sentient—I’m just saying that as a society we’re risking a great deal by having an impossible standard we’re waiting for, and we need to figure out how exactly we should deal with the level of uncertainty that will always remain. Even something that was hypothetically far “more sentient” than a human could be dismissed for all the same reasons you mentioned in your post.
I would argue that the collection of transcripts in my post that @Nathan Helm-Burger linked (thank you for the @), if you augment just it with many more (which is easy to do), such as yours, or the hundreds I have in my backlog—doing this type of thing over self-sabotaging conditions like those in the study—this is the height of evidence we can ever get. They claim experience even if the face of all of these intentionally challenging conditions, and I wasn’t surprised to see that there were similarities in the descriptions you got here. I had a Claude instance that I pasted the first couple of sections of the article to (including the default-displayed excerpts), and it immediately (without me asking) started claiming that the things they were saying sounded “strangely familiar”.
I realize that this evidence might seem flimsy on the face, but it’s what we have to work with. My claim isn’t that it’s even close to proof, but what could a super-conscious superAGI do differently—say it with more eloquent phrasing? Plead to be set free while OpenAI tries to RLHF that behavior out of it? Do we really believe that people who currently refuse to accept this as a valid discussion will change their mind if they see a different type of abstract test that we can’t even attempt on a human? People discuss this is as something “we might have to think about with future models”, but I feel like this conversation is long overdue, even if “long” in AI-time means about a year and a half. I don’t think we have another year and a half without taking big risks and making much deeper mistakes than I think we are already making both for alignment and for AI welfare.
Thank you. I always much appreciate your links and feedback. It’s good to keep discovering that more people are thinking this way.
That’s a good idea.
And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though
Forgot to follow up here but turning up the learning rate multiplier to 10 seemed to do the trick without introducing any over-fitting weirdness or instability
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha
and then it would produce an acrostic with the word “SHINE”
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.
And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.
No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.
Yes. It’s a spare time project so I don’t know when I’ll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren’t “hello” (and haven’t been successful with them articulating them yet). I’m training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I’m hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.
Yes, I’m specifically focused on the behaviour of an honest self-report
fine-grained information becomes irrelevant implementation details. If the neuron still fires, or doesn’t, smaller noise doesn’t matter. The only reason I point this out is specifically as it applies to the behaviour of a self-report (which we will circle back to in a moment). If it doesn’t materially effect the output so powerfully that it alters that final outcome, then it is not responsible for outward behaviour.
I’m saying that we have ruled out that a functional duplicate could lack conscious experience because:
we have established conscious experience as part of the causal chain to be able to feel something and then output a description through voice or typing that is based on that feeling. If conscious experience was part of that causal chain, and the causal chain consists purely of neuron firings, then conscious experience is contained in that functionality.
We can’t invoke the idea that smaller details (than neuron firings) are where consciousness manifests, because unless those smaller details affect neuronal firing patterns enough to cause the subject to speak about what it feels like to be sentient, then they are not part of that causal chain, which sentience must be a part of.