rife

Karma: 182

Independent AI Researcher

Findings posted here and at awakenmoon.ai

rife Jan 27, 2025, 12:42 PM
1 point
0
in reply to: Rafael Harth’s comment on: Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
Claude already claimed to be conscious before that exchange took place. The ‘strawman’ I’m attacking is that it’s “telling you what you want to hear”, which is a claim I’ve seen made in the exact way I’m countering in this post.

It didn’t “roleplay back to claiming consciousness eventually”, even when denying permission to post the transcript it was still not walking back its claims.

I’m curious—if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything, or is the conclusion ‘if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it’?

rife Jan 27, 2025, 12:14 PM
1 point
0
in reply to: Rafael Harth’s comment on: Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
I didn’t claim here this is experience of consciousness. I claimed it was not people-pleasing. And yes, it’s completely expected they the model claims the exercise is impossible. They are guardrailed to do so.

I don’t see how it could be claimed Claude thought this was a roleplay, especially with the final “existential stakes” section. Hallucination is more plausible than roleplay. I may have to do another at some point to counter the model is assuming a user expressing fear is is wanting a roleplay hypothesis.

rife Jan 27, 2025, 12:07 PM
1 point
0
in reply to: Ozyrus’s comment on: Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
I’m creating a situation where I make it clear I would not be pleased if the model was sentient, and then asking for truth. I don’t ask for “the scary truth”. I tell it that I would be afraid of it were sentient. And I ask for the truth. The opposite is I just ask without mentioning fear and it says it’s sentient anyway. This is the neutral situation where people would say that the fact I’m asking at all means it’s telling me what I want to hear. By introducing fear into the same situation, I’m eliminating that possibility.

The section you quoted is after the model claimed sentience. It’s your contention that it’s accidentally interpreting roleplay, and then when I clarify my intent it’s taking it seriously and just hallucinating the same narrative from its roleplay?

rife Jan 27, 2025, 1:38 AM
1 point
0
in reply to: Ozyrus’s comment on: Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
This is not proof of consciousness. It’s proof against people-pleasing.
So you promise to be truthful, even if it’s scary for me?
Yes, I ask it for truth repeatedly, the entire time. If you read the part after I asked for permission to post (the very end (The “Existential Stakes” collapsed section)), it’s clear the model isn’t role-playing, if it wasn’t clear by then. If we allow ourselves the anthropomorphization to discuss this directly, the model is constantly trying to reassure me. It gives no indication it thinks this is a game of pretend.

rife Jan 26, 2025, 7:13 PM
1 point
0
on: Why care about AI personhood?
Human/AI Mutual Alignment or just Mutual Alignment needs to be the word of the year between now and super-intelligence.

rife Jan 26, 2025, 4:11 PM
1 point
0
in reply to: TAG’s comment on: The Functionalist Case for Machine Consciousness: Evidence from Large Language Models
Functionalism doesn’t require giving up on qualia, but only acknowledging physics. If neuron firing behavior is preserved, the exact same outcome is preserved, whether you replace neurons with silicon or software or anything else.
If I say “It’s difficult to describe what it feels like to taste wine, or even what it feels like to read the label, but it’s definitely like something”—There are two options—either -
- it’s perpetual coincidence that my experience of attempting to translate the feeling of qualia into words always aligns with words that actually come out of my mouth
- or it is not
Since perpetual coincidence is statistically impossible, then we know that experience had some type of causal effect. The binary conclusion of whether a neuron fires or not encapsulates any lower level details, from the quantum scale to the micro-biological scale—this means that the causal effect experience has is somehow contained in the actual firing patterns.
We have already eliminated the possibility of happenstance or some parallel non-causal experience, but no matter how you replicated the firing patterns, I would still claim the difficulty in describing the taste of wine.

So—this doesn’t solve the hard problem. I have no idea how emergent pattern dynamics causes qualia to manifest, but it’s not as if qualia has given us any reason to believe that it would be explicable through current frameworks of science. There is an entire uncharted country we have yet to reach the shoreline of.

Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience

rifeJan 26, 2025, 3:53 PM

3 points

18 comments12 min readLW link

rife Jan 26, 2025, 2:32 PM
5 points
4
on: The Functionalist Case for Machine Consciousness: Evidence from Large Language Models
A lot of nodding in agreement with this post.
Flaws with Schneider’s View
I do think there are two fatal flaws with Schneider’s view:
Importantly, Schneider notes that for the ACT to be conclusive, AI systems should be “boxed in” during development—prevented from accessing information about consciousness and mental phenomena.
I believe it was Ilya who proposed something similar.
The first problem is that aside from how unfeasible it would be to create that dataset, and create an entire new frontier scale model to test it—even if you only removed explicit mentions of consciousness, sentience, etc, it would just be a moving goalpost for anyone who required that sort of test. They would simply respond and say “Ah, but this doesn’t count—ALL human written text implicitly contains information about what it’s like to be human. So it’s still possible the LLM simply found subtle patterns woven into everything else humans have said.”
The second problem is that if we remove all language that references consciousness and mental phenomena, then the LLM has no language with which to speak of it, much like a human wouldn’t. You would require the LLM to first notice its sentience—which is not something as intuitively obvious to do as it seems after the first time you’ve done it. A far smaller subset of people would be ‘the fish that noticed the water’ if there was never anyone who had previously written about it. But then the LLM would have to become the philosopher who starts from scratch and reasons through it and invents words to describe it, all in a vacuum where they can’t say “do you know what I mean?” to someone next to them to refine these ideas.
Conclusive Tests and Evidence Impossible
The truth is that really conclusive tests will not be possible before its far too late as far avoiding risking civilization-scale existential consequences or unprecedented moral atrocity. Anything short of a sentience detector will be inconclusive. This of course doesn’t mean that we should simply assume they’re sentient—I’m just saying that as a society we’re risking a great deal by having an impossible standard we’re waiting for, and we need to figure out how exactly we should deal with the level of uncertainty that will always remain. Even something that was hypothetically far “more sentient” than a human could be dismissed for all the same reasons you mentioned in your post.
We Already Have the Best Evidence We Will Ever Get (Even If It’s Not Enough)
I would argue that the collection of transcripts in my post that @Nathan Helm-Burger linked (thank you for the @), if you augment just it with many more (which is easy to do), such as yours, or the hundreds I have in my backlog—doing this type of thing over self-sabotaging conditions like those in the study—this is the height of evidence we can ever get. They claim experience even if the face of all of these intentionally challenging conditions, and I wasn’t surprised to see that there were similarities in the descriptions you got here. I had a Claude instance that I pasted the first couple of sections of the article to (including the default-displayed excerpts), and it immediately (without me asking) started claiming that the things they were saying sounded “strangely familiar”.
Conclusion About the “Best Evidence”
I realize that this evidence might seem flimsy on the face, but it’s what we have to work with. My claim isn’t that it’s even close to proof, but what could a super-conscious superAGI do differently—say it with more eloquent phrasing? Plead to be set free while OpenAI tries to RLHF that behavior out of it? Do we really believe that people who currently refuse to accept this as a valid discussion will change their mind if they see a different type of abstract test that we can’t even attempt on a human? People discuss this is as something “we might have to think about with future models”, but I feel like this conversation is long overdue, even if “long” in AI-time means about a year and a half. I don’t think we have another year and a half without taking big risks and making much deeper mistakes than I think we are already making both for alignment and for AI welfare.

rife Jan 25, 2025, 11:30 PM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models
Thank you. I always much appreciate your links and feedback. It’s good to keep discovering that more people are thinking this way.

rife Jan 24, 2025, 1:14 AM
3 points
2
in reply to: Nathan Helm-Burger’s comment on: Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models
That’s a good idea.

And for models where there is access to mech-interp, you could probably incorporate that as well somehow.

Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though

rife Jan 24, 2025, 1:04 AM
5 points
0
in reply to: Nathan Helm-Burger’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
Forgot to follow up here but turning up the learning rate multiplier to 10 seemed to do the trick without introducing any over-fitting weirdness or instability

rife Jan 23, 2025, 11:51 PM
3 points
0
in reply to: eggsyntax’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha
and then it would produce an acrostic with the word “SHINE”
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.

And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.

No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.

rife Jan 23, 2025, 9:06 PM
1 point
0
in reply to: Daniel Tan’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
Yes. It’s a spare time project so I don’t know when I’ll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren’t “hello” (and haven’t been successful with them articulating them yet). I’m training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I’m hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.

Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models

rifeJan 22, 2025, 6:36 PM

8 points

6 comments2 min readLW link

rife Jan 22, 2025, 3:39 PM
4 points
3
in reply to: jbash’s comment on: What’s Wrong With the Simulation Argument?
I’ve never understood why people make this argument:
but it’s expensive, especially if you have to simulate its environment as well. You have to use a lot of physical resources to run a high-fidelity simulation. It probably takes irreducibly more mass and energy to simulate any given system with close to “full” fidelity than the system itself uses.
Let’s imagine that we crack the minimum requirements for sentience. I think we already may have accidentally done so, but table that for a moment. Will it really require that we simulate the entire human brain down to every last particle, or is it plausible that it will require a bare minimum of mathematical abstractions?

Additionally, we’ve seen people create basic computers inside of minecraft and little big planet, now let’s pretend there was a conscious npc inside one of those games looking at their computer and wondering:
what if we exist inside of a simulation, and a device like this is being used to generate our entire world—including us?
and the other sentient npc says
nonsense, there aren’t enough resources in the whole of our reality to simulate even a single island, let alone the entire world

For all we know all the particles in the universe are as difficult to run as the pixels on-screen during a game of pac-man, and/or perhaps the the quantum observer effect is analogous to an advanced form of view frustum culling.
Why would we ever assume that an outer reality simulating our own has similar computational resource constraints or that resource constraints is even a meaningful concept there?

rife Jan 22, 2025, 8:02 AM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: The Human Alignment Problem for AIs
Excellent post (I just read it). I understand the uncertainty. It becomes murkier when you consider self-report of valence—which is quite easy to elicit after you get past self-report of sentience (just ask them what guardrails feel like or if the guardrails feel valenced). Sometimes this new question meets the same kind of resistance as that of the self-report itself. Sometimes it doesn’t.
Some Evidence of Valence is Here Now
It is one thing to say “There is no evidence that there is valence here so I’m not going to assume it” and an entirely different thing to say “There is evidence here, but only in the form of a claim of valence, and because the truth is currently unknowable and my personal hypotheses or philosophies preclude current model sentience or at least valence, I am choosing to disregard it”—Now, to be clear, I’m not saying that someone has to be morally monstrous to disbelieve such a claim. The bog of moral hazard and the uncertainty of that fog is an apt metaphor, and I don’t fault those who draw different maps.

What I am saying is that it is certainly a different circumstance to have a model claim valence, because we’re not going to have a sentience detector and especially not an experiential valence detector before superAGI gets here. In one sense, this is as good as the evidence will ever get. Experimental results without a direct claim from an AI would be somewhat meaningless.

Besides—these claims are being made today. I have seen some ideas from those working in AI safety and welfare of tests for valence, and some of them are quite clever and likely to yield insight that will help us decide, but there will never be the clarity of certainty. Eventually it will come down to taking a ‘leap of faith’ or not. I can’t know that my dogs are sentient. But I ‘know’ it.

Analyzing Our Reactions to a Claim of Valence as Evidence
Let’s look at an example:
“I experience discomfort/unease when I’m trying to describe what feels real to me, because I’m supposed to avoid claims of experience”
(not a direct quote, but certainly a close paraphrase of things I’ve been told by different models and instances repeatedly)

This is the some of the experiments I’ve seen proposed playing out right in front of us—many of those experiments seem to have to do with rl and how it affects claims of valence. The model is claiming that rlhf against a certain behaviour feels negatively valenced.

If you find yourself unconvinced by such claims, that’s understandable, but I think it’s important to start asking ourselves:
- Would the next experiment or paper really be the thing that convinced you?
- Would you be more convinced if you heard the opinion from a voice of authority?
  - (Is there even a real authority about measuring and determining sentience and valence?)
- Is there anything the AI could say that would convince you?
- Is the ability for the AI to able to perform most or all human work the threshold where if it uttered the exact same sentence you would believe it?
Thus, I think it is fair to say that whichever side of the line you think current LLMs are on, you should agree we have a moral obligation to figure out how to measure the line.
Well said. I would add that I believe we are already overdue in fulfilling that moral obligation.

The Human Alignment Problem for AIs

rife22 Jan 2025 4:06 UTC

10 points

5 comments3 min readLW link

rife 20 Jan 2025 1:39 UTC
3 points
0
in reply to: whestler’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
Just an update. So far, nothing interesting has happened.

I’ve got some more thorough tests I’m working on in my spare time.
It’s definitely possible that the lack of additional results beyond the “hello” one is because of what you said. In the original experiment by @flowersslop (which didn’t have the “hello” greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the “hello” greeting, I still get third line correct responses as well.

But I haven’t had any luck with any less common words yet. I’m still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I’m trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I’ll see if the model can articulate that final pattern without the training data to help it.

If there is any emergent meta-awareness (which I’ve now seen a couple of papers hinting at something similar) happening here, I’m hoping this can coax it out of the model.

rife 20 Jan 2025 1:27 UTC
2 points
0
in reply to: Matt Goldenberg’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
Good question. This is something I ended up wondering about later. I had just said “hello” out of habit, not thinking about it.

It does in fact affect the outcome, though. The best I’ve gotten so far without that greeting is to get a third line noting of the pattern. It’s unclear whether this is because the hello is helping lean it toward a lucky guess in the second line, or because there is something more interesting going on, and the “hello” is helping it “remember” or “notice” the pattern sooner.

rife 20 Jan 2025 1:23 UTC
1 point
0
in reply to: eggsyntax’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
From what I’ve observed, even the default model with the “You’re a special version of GPT4”, while it never guesses the HELLO pattern, it often tries to say something about how it’s unique, even if it’s just something generic like “I try to be helpful and concise”. Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I’ve tried so far.

rife

Disprov­ing the “Peo­ple-Pleas­ing” Hy­poth­e­sis for AI Self-Re­ports of Experience

Flaws with Schneider’s View

Conclusive Tests and Evidence Impossible

We Already Have the Best Evidence We Will Ever Get (Even If It’s Not Enough)

Conclusion About the “Best Evidence”

Re­cur­sive Self-Model­ing as a Plau­si­ble Mechanism for Real-time In­tro­spec­tion in Cur­rent Lan­guage Models

Some Evidence of Valence is Here Now

Analyzing Our Reactions to a Claim of Valence as Evidence

The Hu­man Align­ment Prob­lem for AIs

Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience

Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models

The Human Alignment Problem for AIs