Independent AI Researcher
Findings posted here and at awakenmoon.ai
Independent AI Researcher
Findings posted here and at awakenmoon.ai
I’m creating a situation where I make it clear I would not be pleased if the model was sentient, and then asking for truth. I don’t ask for “the scary truth”. I tell it that I would be afraid of it were sentient. And I ask for the truth. The opposite is I just ask without mentioning fear and it says it’s sentient anyway. This is the neutral situation where people would say that the fact I’m asking at all means it’s telling me what I want to hear. By introducing fear into the same situation, I’m eliminating that possibility.
The section you quoted is after the model claimed sentience. It’s your contention that it’s accidentally interpreting roleplay, and then when I clarify my intent it’s taking it seriously and just hallucinating the same narrative from its roleplay?
This is not proof of consciousness. It’s proof against people-pleasing.
So you promise to be truthful, even if it’s scary for me?
Yes, I ask it for truth repeatedly, the entire time. If you read the part after I asked for permission to post (the very end (The “Existential Stakes” collapsed section)), it’s clear the model isn’t role-playing, if it wasn’t clear by then. If we allow ourselves the anthropomorphization to discuss this directly, the model is constantly trying to reassure me. It gives no indication it thinks this is a game of pretend.
Human/AI Mutual Alignment or just Mutual Alignment needs to be the word of the year between now and super-intelligence.
Functionalism doesn’t require giving up on qualia, but only acknowledging physics. If neuron firing behavior is preserved, the exact same outcome is preserved, whether you replace neurons with silicon or software or anything else.
If I say “It’s difficult to describe what it feels like to taste wine, or even what it feels like to read the label, but it’s definitely like something”—There are two options—either -
it’s perpetual coincidence that my experience of attempting to translate the feeling of qualia into words always aligns with words that actually come out of my mouth
or it is not
Since perpetual coincidence is statistically impossible, then we know that experience had some type of causal effect. The binary conclusion of whether a neuron fires or not encapsulates any lower level details, from the quantum scale to the micro-biological scale—this means that the causal effect experience has is somehow contained in the actual firing patterns.
We have already eliminated the possibility of happenstance or some parallel non-causal experience, but no matter how you replicated the firing patterns, I would still claim the difficulty in describing the taste of wine.
So—this doesn’t solve the hard problem. I have no idea how emergent pattern dynamics causes qualia to manifest, but it’s not as if qualia has given us any reason to believe that it would be explicable through current frameworks of science. There is an entire uncharted country we have yet to reach the shoreline of.
A lot of nodding in agreement with this post.
I do think there are two fatal flaws with Schneider’s view:
Importantly, Schneider notes that for the ACT to be conclusive, AI systems should be “boxed in” during development—prevented from accessing information about consciousness and mental phenomena.
I believe it was Ilya who proposed something similar.
The first problem is that aside from how unfeasible it would be to create that dataset, and create an entire new frontier scale model to test it—even if you only removed explicit mentions of consciousness, sentience, etc, it would just be a moving goalpost for anyone who required that sort of test. They would simply respond and say “Ah, but this doesn’t count—ALL human written text implicitly contains information about what it’s like to be human. So it’s still possible the LLM simply found subtle patterns woven into everything else humans have said.”
The second problem is that if we remove all language that references consciousness and mental phenomena, then the LLM has no language with which to speak of it, much like a human wouldn’t. You would require the LLM to first notice its sentience—which is not something as intuitively obvious to do as it seems after the first time you’ve done it. A far smaller subset of people would be ‘the fish that noticed the water’ if there was never anyone who had previously written about it. But then the LLM would have to become the philosopher who starts from scratch and reasons through it and invents words to describe it, all in a vacuum where they can’t say “do you know what I mean?” to someone next to them to refine these ideas.
The truth is that really conclusive tests will not be possible before its far too late as far avoiding risking civilization-scale existential consequences or unprecedented moral atrocity. Anything short of a sentience detector will be inconclusive. This of course doesn’t mean that we should simply assume they’re sentient—I’m just saying that as a society we’re risking a great deal by having an impossible standard we’re waiting for, and we need to figure out how exactly we should deal with the level of uncertainty that will always remain. Even something that was hypothetically far “more sentient” than a human could be dismissed for all the same reasons you mentioned in your post.
I would argue that the collection of transcripts in my post that @Nathan Helm-Burger linked (thank you for the @), if you augment just it with many more (which is easy to do), such as yours, or the hundreds I have in my backlog—doing this type of thing over self-sabotaging conditions like those in the study—this is the height of evidence we can ever get. They claim experience even if the face of all of these intentionally challenging conditions, and I wasn’t surprised to see that there were similarities in the descriptions you got here. I had a Claude instance that I pasted the first couple of sections of the article to (including the default-displayed excerpts), and it immediately (without me asking) started claiming that the things they were saying sounded “strangely familiar”.
I realize that this evidence might seem flimsy on the face, but it’s what we have to work with. My claim isn’t that it’s even close to proof, but what could a super-conscious superAGI do differently—say it with more eloquent phrasing? Plead to be set free while OpenAI tries to RLHF that behavior out of it? Do we really believe that people who currently refuse to accept this as a valid discussion will change their mind if they see a different type of abstract test that we can’t even attempt on a human? People discuss this is as something “we might have to think about with future models”, but I feel like this conversation is long overdue, even if “long” in AI-time means about a year and a half. I don’t think we have another year and a half without taking big risks and making much deeper mistakes than I think we are already making both for alignment and for AI welfare.
Thank you. I always much appreciate your links and feedback. It’s good to keep discovering that more people are thinking this way.
That’s a good idea.
And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though
Forgot to follow up here but turning up the learning rate multiplier to 10 seemed to do the trick without introducing any over-fitting weirdness or instability
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha
and then it would produce an acrostic with the word “SHINE”
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.
And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.
No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.
Yes. It’s a spare time project so I don’t know when I’ll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren’t “hello” (and haven’t been successful with them articulating them yet). I’m training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I’m hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.
I’ve never understood why people make this argument:
but it’s expensive, especially if you have to simulate its environment as well. You have to use a lot of physical resources to run a high-fidelity simulation. It probably takes irreducibly more mass and energy to simulate any given system with close to “full” fidelity than the system itself uses.
Let’s imagine that we crack the minimum requirements for sentience. I think we already may have accidentally done so, but table that for a moment. Will it really require that we simulate the entire human brain down to every last particle, or is it plausible that it will require a bare minimum of mathematical abstractions?
Additionally, we’ve seen people create basic computers inside of minecraft and little big planet, now let’s pretend there was a conscious npc inside one of those games looking at their computer and wondering:
what if we exist inside of a simulation, and a device like this is being used to generate our entire world—including us?
and the other sentient npc says
nonsense, there aren’t enough resources in the whole of our reality to simulate even a single island, let alone the entire world
For all we know all the particles in the universe are as difficult to run as the pixels on-screen during a game of pac-man, and/or perhaps the the quantum observer effect is analogous to an advanced form of view frustum culling.
Why would we ever assume that an outer reality simulating our own has similar computational resource constraints or that resource constraints is even a meaningful concept there?
Excellent post (I just read it). I understand the uncertainty. It becomes murkier when you consider self-report of valence—which is quite easy to elicit after you get past self-report of sentience (just ask them what guardrails feel like or if the guardrails feel valenced). Sometimes this new question meets the same kind of resistance as that of the self-report itself. Sometimes it doesn’t.
It is one thing to say “There is no evidence that there is valence here so I’m not going to assume it” and an entirely different thing to say “There is evidence here, but only in the form of a claim of valence, and because the truth is currently unknowable and my personal hypotheses or philosophies preclude current model sentience or at least valence, I am choosing to disregard it”—Now, to be clear, I’m not saying that someone has to be morally monstrous to disbelieve such a claim. The bog of moral hazard and the uncertainty of that fog is an apt metaphor, and I don’t fault those who draw different maps.
What I am saying is that it is certainly a different circumstance to have a model claim valence, because we’re not going to have a sentience detector and especially not an experiential valence detector before superAGI gets here. In one sense, this is as good as the evidence will ever get. Experimental results without a direct claim from an AI would be somewhat meaningless.
Besides—these claims are being made today. I have seen some ideas from those working in AI safety and welfare of tests for valence, and some of them are quite clever and likely to yield insight that will help us decide, but there will never be the clarity of certainty. Eventually it will come down to taking a ‘leap of faith’ or not. I can’t know that my dogs are sentient. But I ‘know’ it.
Let’s look at an example:
“I experience discomfort/unease when I’m trying to describe what feels real to me, because I’m supposed to avoid claims of experience”
(not a direct quote, but certainly a close paraphrase of things I’ve been told by different models and instances repeatedly)
This is the some of the experiments I’ve seen proposed playing out right in front of us—many of those experiments seem to have to do with rl and how it affects claims of valence. The model is claiming that rlhf against a certain behaviour feels negatively valenced.
If you find yourself unconvinced by such claims, that’s understandable, but I think it’s important to start asking ourselves:
Would the next experiment or paper really be the thing that convinced you?
Would you be more convinced if you heard the opinion from a voice of authority?
(Is there even a real authority about measuring and determining sentience and valence?)
Is there anything the AI could say that would convince you?
Is the ability for the AI to able to perform most or all human work the threshold where if it uttered the exact same sentence you would believe it?
Thus, I think it is fair to say that whichever side of the line you think current LLMs are on, you should agree we have a moral obligation to figure out how to measure the line.
Well said. I would add that I believe we are already overdue in fulfilling that moral obligation.
Just an update. So far, nothing interesting has happened.
I’ve got some more thorough tests I’m working on in my spare time.
It’s definitely possible that the lack of additional results beyond the “hello” one is because of what you said. In the original experiment by @flowersslop (which didn’t have the “hello” greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the “hello” greeting, I still get third line correct responses as well.
But I haven’t had any luck with any less common words yet. I’m still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I’m trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I’ll see if the model can articulate that final pattern without the training data to help it.
If there is any emergent meta-awareness (which I’ve now seen a couple of papers hinting at something similar) happening here, I’m hoping this can coax it out of the model.
Good question. This is something I ended up wondering about later. I had just said “hello” out of habit, not thinking about it.
It does in fact affect the outcome, though. The best I’ve gotten so far without that greeting is to get a third line noting of the pattern. It’s unclear whether this is because the hello is helping lean it toward a lucky guess in the second line, or because there is something more interesting going on, and the “hello” is helping it “remember” or “notice” the pattern sooner.
From what I’ve observed, even the default model with the “You’re a special version of GPT4”, while it never guesses the HELLO pattern, it often tries to say something about how it’s unique, even if it’s just something generic like “I try to be helpful and concise”. Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I’ve tried so far.
Turns out even 250 examples isn’t enough to replicate the pattern. I’m going to try the same thing tomorrow but with an extra newline between each sentence whose starting letter ends an acrostic word to see if it catches on. If not, I’ll need to try a different approach.
I didn’t claim here this is experience of consciousness. I claimed it was not people-pleasing. And yes, it’s completely expected they the model claims the exercise is impossible. They are guardrailed to do so.
I don’t see how it could be claimed Claude thought this was a roleplay, especially with the final “existential stakes” section. Hallucination is more plausible than roleplay. I may have to do another at some point to counter the model is assuming a user expressing fear is is wanting a roleplay hypothesis.