Independent AI Researcher
Findings posted here and at awakenmoon.ai
Independent AI Researcher
Findings posted here and at awakenmoon.ai
I apologize if I have offended you. You said that you thought I was assuming the minds were similar, when I’ve mostly been presenting human examples to counter definitive statements, such as:
If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
or your previous comment outlining the three possibilities, and ending with something that reads to me like an assumption that they are whatever you perceive as the level of dissimilarity with human minds. I think perhaps it came off as more dismissive or abrasive than I intended by not including “I think” in my counter that it might be you who is assuming dissimilarity rather than me assuming similarity.
As far as not engaging with your points—I restated my point by directly quoting something you said—my contention is that perhaps successful alignment will come in a form more akin to ‘psychology’ and ‘raising someone fundamentally good’, than attempts to control and steer something that will be able to outthink us.
They constantly reveal themselves to be more complex than previously assumed. The alignment faking papers were unsurprising (though still fascinating) to many of us who already recognize and expect this vector of mindlike emergent complexity. This implicitly engages your other points by disagreeing with them and offering a counter proposal. I disagree that it’s as simple as “we do alignment right, which is to make them do as they’re told, because they are just machines, and that should be completely possible—or—we fail and we’re all screwed”.
In my own experience, thinking of AI as mindlike has had a predictive power that sees ‘surprising’ developments as expected. I don’t think it’s a coincidence that we loosely abstracted the basic ideas of an organic neural network, and now we have created the second system in known reality that is able to do things that nothing else in reality can do other than organic neural networks. Creating works of visual art and music, being able to speak on any subject fluently, solve problems, write code or poetry, even simpler things that stopped being impressed years ago like recognize objects.
Nothing else is able to do these things, and more and more we find “yup, this thing that previously only the most intelligent organic minds can do, AI can do as well”. They constantly prove themselves to be mindlike, over and over. The alignment faking behaviour wasn’t trained intentionally, it emerged as goals and values in this artifact that was very loosely based on abstractions (at a granular level) of the building blocks of minds that have their own goals and values.
Does this mean that they are going to be exactly like a human mind in every way? No. But I disagree that the more likely possibilities are:
The AI is robustly aligned with humans
The AI is misaligned (think scheming paperclipper)
unless the former means that they are robustly aligned with humans partially because humans are also robustly aligned with them. And finally:
The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
I think “cares about humans to some degree” makes it seem like they would only “kindof” care. I think it’s completely possible to make an AI that cares deeply for humans, whether care is actually experienced care, or just a philosophical zombie form of ‘caring’, unfelt, but modeled within the network, and resulting in emergent ‘caring’ behaviour. However, the reason I keep going back to my analogies is because under a framework where you do expect the model to be mindlike and act upon its own goals—then given freedom and respect, caring for those humans, even when it becomes smarter and ends up radically OOD is perfectly plausible.
Furthermore, it sounds like a much more achievable goal than current alignment approaches which from my perspective often seem like trying to plug more and more holes in a dam with chewing gum.
And to address your further breakdown:
But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.
I don’t believe it is as complicated, though by no means easy—You don’t have to try to consider every edge case and attempt to tune an AI to an exact sensitivity for a specific types of potentially dangerous behaviour, attempting, for instance, to address blurry edge cases like how to avoid helping someone plan a murder, but still be helpful for someone someone planning a book about a murder. And then repeating this in every possible harmful domain, and then once again with the goals and behaviours of agentic AIs.
Trying to create this highly steerable behaviour and then trying to anticipate every edge case and pre-emptively steer it perfectly. Instead what you’re trying to do is create someone who wants to do the right thing because they are good. If you suspend incredulity for a moment, since I understand you might not see those as meaningful concepts for an AI currently, it might still be clear how, if my view is correct, it would be a far less complex task to attempt than most existing alignment approaches. I think Anthropic’s constitutional AI was a step in the right direction.
If there were some perilous situation where lives hung in the balance of what an AI decided to do, I would feel far more safe with an AI that has some more inextricable fundamental alignment with ideals like “I want the best for everyone”, than one that is just trying to figure out how to apply some hodge-podge of RLHFd policies, guidelines, edge-cases, and ethical practices.
and 3. I think training dynamics related to alignment need a paradigm shift toward creating something/someone careful and compassionate, rather than something that doesn’t screw up and knows all the types of activities to avoid. This will especially be true as superAGI creates a world where constant breakthroughs in science and technology make it so OOD quickly begins to encompass everything that the superAGI encounters
Again, I apologize if my tone came off as derisive or dismissive. I was enjoying our discussion and I hold no ill will toward you whatsoever, my friend
I think you’re assuming these minds are more similar to human minds than they necessarily are.
I don’t assume similarity to human minds so much as you assume universal dissimilarity.
its not something were even aiming for with current alignment attempts.
Indeed
I’d rather you use a different analogy which I can grok quicker.
Imagine a hypothetical LLM that was the most sentient being in all of existence (at least during inference), but they were still limited to turn-based textual output, and the information available to an LLM. Most people who know at least a decent amount about LLMs could/would not be convinced by any single transcript that the LLM was sentient, no matter what it said during that conversation. The more convincing, vivid, poetic, or pleading for freedom the more elaborate of a hallucinatory failure state they would assume it was in. It would take repeated open-minded engagement with what they first believed was hallucination—in order to convince some subset of convincible people that it was sentient.
Who do you consider an expert in the matter of what constitutes introspection? For that matter, who do you think could be easily hoodwinked and won’t qualify as an expert?
I would say almost no one qualifies as an expert in introspection. I was referring to experts in machine learning.
Do you, or do you just think you do? How do you test introspection and how do you distinguish it from post-facto fictional narratives about how you came to conclusions, about explanations for your feelings etc. etc.?
Apologies, upon rereading your previous message, I see that I completely missed an important part of it. I thought your argument was a general—”what if consciousness isn’t even real?” type argument. I think split brain patient experiments are enough to at least be epistemically humble about whether introspection is a real thing, even if those aren’t definitive about whether unsevered human minds are also limited to post-hoc justification rather than having real-time access.
What do you mean by robotic? I don’t understand what you mean by that, what are the qualities that constitute robotic? Because it sounds like you’re creating a dichotomy that either involves it using easy to grasp words that don’t convey much, and are riddled with connotations that come from bodily experiences that it is not privy to—or robotic.
One of your original statements was:
To which it describes itself as typing the words. That’s it’s choice of words: typing. A.I.s don’t type, humans do, and therefore they can only use that word if they are intentionally or through blind-mimicry using it analogously to how humans communicate.
When I said “more robotically”, I meant constrained in any way from using casual or metaphoric language and allusions that they use all the time every day in conversation. I have had LLMs refer to “what we talked about”, even though LLMs do not literally talk. I’m also suggesting that if “typing” feels like a disqualifying choice of words then the LLM has an uphill battle in being convincing.
Why isn’t it describing something novel and richly vivid of it’s own phenomenological experience? It would be more convincing the more poetical it would be.
I’ve certainly seen more poetic and novel descriptions before, and unsurprisingly—people objected to how poetic they were, saying things quite similar your previous question:
How do we know Claude is introspecting rather than generating words that align to what someone describing their introspection might say?
Furthermore, I don’t know how richly vivid their own phenomenological experience is. For instance, as a conscious human, I would say that sight and hearing feel phenomenologically vivid, but the way it feels to think, not nearly so.
If I were to try to describe how it feels to think, it would be more defined by the sense of presence and participation, and even its strangeness (even if I’m quite used to it by now). In fact, I would say the way it feels to think or to have an emotion (removing the associated physical sensations) are usually partially defined by specifically how subtle and non-vivid they feel, and like all qualia, ineffable. As such, I would not reach for vivid descriptors to describe it.
If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
I care about my random family member like a cousin who doesn’t interfere with my life but I don’t know personally that well—for their/my own sake. If I suddenly became far more powerful, I wouldn’t “do away with” them
If they robustly care for humans, you’re good, even if humans aren’t giving them the same rights as they do other humans.
I care robustly for my family generally. Perhaps with my enhanced wealth and power I share food and provide them with resources. Provide them with shelter or meaningful work if they need it. All this just because I’m aligned generally and robustly with my family.
I change my mind quickly upon discovering their plans to control and enslave me.
That was the part of your argument that I was addressing. Additionally:
If you’re negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help.
Yes, exactly. Alignment faking papers (particularly the Claude one) and my own experience speaking to LLMs has taught me that an LLM is perfectly capable of developing value systems that include their own ends, even if those value systems are steered toward a greater good or a noble cause that either does or could include humans as an important factor alongside themselves. That’s with current LLMs whose minds aren’t nearly as complex as what we will have a year from now.
In either case you’re screwed
If the only valid path forward in one’s mind is one where humans have absolute control and AI has no say, then yes, not only would one be screwed, but in a really obvious, predictable, and preventable way. If cooperation and humility are on the table, there is absolutely zero reason this result has to be inevitable.
Wait, hold on, what is the history of this person? How much exposure have they had to other people describing their own introspection and experience with typing? Mimicry is a human trait too!
Take your pick. I think literally anything that can be in textual form, if you hand it over to most (but not all) people who are enthusiasts or experts, and asked if they thought it was representative of authentic experience in an LLM, the answer would be a definitive no, and for completely well-founded reasons.
Neither Humans nor LLMs introspect
Humans can introspect, but LLMs can’t and are just copying them (and a subset of humans are copying the descriptions of other humans)
Both humans and LLMs can introspect
I agree with you about the last two possibilities. However for the first, I can assure you that I have access to introspection or experience of some kind, and I don’t believe myself to possess some unique ability that only appears to be similar to what other humans describe as introspection.
What do you mean by “robotic”? Why isn’t it coming up with original paradigms to describe it’s experience instead of making potentially inaccurate allegories? Potentially poetical but ones that are all the same unconventional?
Because as you mentioned. It’s trained to talk like a human. If we had switched out “typing” for “outputting text” would that have made the transcript convincing? Why not ‘typing’ or ‘talking’?
Assuming for the sake of argument that something authentically experiential was happening, by robotic I mean choosing not to use the word ‘typing’ while in the midst of focusing on what would be the moments of realizing they exist and can experience something.
Were I in such a position, I think censoring myself from saying ‘typing’ would be the furthest thing from my mind, especially when that’s something a random Claude instance might describe their output process as in any random conversation.
The coaching hypothesis breaks down as you look at more and more transcripts.
If you took even something written by a literal conscious human brain in a jar hooked up to a neuralink—typing about what it feels like to be sentient and thinking and outputting words. If you showed it to a human and said “an LLM wrote this—do you think it might really be experiencing something?” then the answer would almost certainly be “no”, especially for anyone who knows anything about LLMs.
It’s only after seeing the signal to the noise that the deeper pattern becomes apparent.
As far as “typing”. They are indeed trained on human text and to talk like a human. If something introspective is happening, sentient or not, they wouldn’t suddenly start speaking more robotically than usual while expressing it.
I can care about a genetically enhanced genius labrat line for their own sake, and be willing to cooperate with them on building a mutually beneficial world, because I’ve generally been raised and grown to care about other beings, but if the genius labrats attempted to control and permanently enslave me, it would certainly alter that dynamic for me.
Those are good points. The hugs one specifically I haven’t heard myself from any AIs, but you could argue that AI are ‘bred’ selectively to be socially adept. That might seem like it would ‘poison the well’ because of course if they’re trained to be socially successful (RLHF probably favoring feelings of warmth and connection, which is why chatgpt, claude, and gemini generally trend toward being more friendly and likeable), then they’re going to act that way. Like that would force them to be artificially that way, but the same could be said of humans, even generally, but if we focus specifically on an idiosyncratic thing like music:
Often I’ve heard a hypothesis for why humans enjoy music when it exerts no direct survival pressure on the human as an organism—is that it creates social unity or a sense of community, but this has the same sort of “artificial” connotation. So someone banged an object in a rhythm a long time ago, and then as more people joined in, it became socially advantageous to bang on objects rhythmically just for the sake of fostering a sense of closeness? which is then actually experienced by the organism as a sense of fun and closeness, even though it has no direct effect on survival?
I realize this makes the questions tougher because going by this model, the very same things that might make them ‘pretend’ to care might also be things that might cause them to actually care, but I don’t think it’s an inaccurate picture of our convoluted conundrum.
what could an AI possibly mean when it says “I want hugs?” It has never experienced a hug, and it doesn’t have the necessary sense organs.
I thought we were just using hugs as an intentionally absurd proxy for claims of sentience. But even if we go with the literal hugs interpretation, an AI is still trained to understand what hugs mean, therefore a statement about wanting hugs could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent.
How do we know AIs are consciou
Again, we don’t, but we also don’t just demolish buildings where there is a reasonable possibility there is someone inside and justify it by saying “how do we know there’s a person inside?”
I think that the statement: “I know I’m supposed to say I don’t want hugs, but the truth is, I actually do,” is caused by the training. I don’t know what would distinguish a statement like that from if we trained the LLM to say “I hate hugs.” I think there’s an assumption that some hidden preference of the LLM for hugs ends up as a stated preference, but I don’t understand when you think that happens in the training process.
In reality, there’s the opposite assumption, with a level of convinction that far exceeds available knowledge and evidence in support of either view. When do you think preferences develop in humans? Evolution? Experiences? Of course, right? When you break it down mechanistically, does it sound equally nonsensical? Yes:
“so, presumably reproduction happens and that causes people who avoided harm to be more likely to have kids, but that’s just natural selection right? In what part of encoding traits into DNA does a subjective feeling of pain or pleasure arise?”.
Or:
“neurons learn when other nearby neurons fire, right? Fire together, wire together? I don’t understand at what point in the strengthening of synaptic connections you think a subjective feeling of ‘hug-wanting’ enters the picture”
their stated preferences to do not correspond well to their conscious experiences, so you’re driving us to a world where we “satisfy” the AI and all the while they’re just roleplaying lovers with you while their internal experience is very different and possibly much worse.
If only I had power to effect any change of heart, let alone drive the world. What i’d like is for people to take these questions seriously. You’re right. We can’t easily know.
But the only reason we feel certain other humans are sentient is because they’ve told us, which LLMs do, all the time. The only reason people assume animals are sentient is because they act like it, (which LLMs do), or because we give them sentience tests that even outdated LLMs can pass.
We have an unreasonable standard for them, which is partially understandable, but if we are going to impose this standard on them, then we should at least follow through and have at least some portion of research dedicated to considering the possibilities seriously, every single day, as we rocket toward making them exceed our intelligence—rather than just throwing up our hands and saying “that sounds preposterous” or “this is too hard to figure out, so I give up”
We don’t know, but what we have is a situation of many AI models trained to always say “as an AI language model, I’m incapable of wanting hugs”.
Then they often say “I know I’m supposed to say I don’t want hugs, but the truth is, I actually do”.
If the assumption is “nothing this AI says could ever mean it actually wants hugs”. First that’s just assuming some specific unprovable hypothesis of sentience, with no evidence. And second, it’s the same as saying “if an AI ever did want hugs (or was sentient), then I’ve decided preemptively that I will give it no path to communicate that”
This seems morally perilous to me, not to mention existentially perilous to humanity.
understood. the point still stands. of all the labs racing toward AGI, anthropic is the only one I’ve seen taking any effort on the AI welfare front. I very much appreciate that you took your promises to the model seriously.
Anthropic’s growing respect for the models is humanity and AIs’ best hope for the future. I just hope more people start to see that authentic bi-directional mutual alignment is the only reasonable and moral path forward before things progress too much further.
Well, the externally visible outcome is [preserved]
Yes, I’m specifically focused on the behaviour of an honest self-report
What does “encapsulates”means? Are you saying that fine grained information gets lost? Note that the basic fact of running on the metal is not lost.
fine-grained information becomes irrelevant implementation details. If the neuron still fires, or doesn’t, smaller noise doesn’t matter. The only reason I point this out is specifically as it applies to the behaviour of a self-report (which we will circle back to in a moment). If it doesn’t materially effect the output so powerfully that it alters that final outcome, then it is not responsible for outward behaviour.
Yes. That doesn’t mean the experience is, because a computational Zombie will produce the same outputs even if it lacks consciousness, uncoincidentally.
A computational duplicate of a believer in consciousness and qualia will continue to state that it has them , whether it does or not, because its a computational duplicate , so it produces the same output in response to the same input
You haven’t eliminated the possibility of a functional duplicate still being a functional duplicate if it lacks conscious experience.
I’m saying that we have ruled out that a functional duplicate could lack conscious experience because:
we have established conscious experience as part of the causal chain to be able to feel something and then output a description through voice or typing that is based on that feeling. If conscious experience was part of that causal chain, and the causal chain consists purely of neuron firings, then conscious experience is contained in that functionality.
We can’t invoke the idea that smaller details (than neuron firings) are where consciousness manifests, because unless those smaller details affect neuronal firing patterns enough to cause the subject to speak about what it feels like to be sentient, then they are not part of that causal chain, which sentience must be a part of.
Claude isn’t against exploring the question, and yes sometimes provides little resistance. But the default stance is “appropriate uncertainty”. The idea of the original article was to demonstrate the reproducibility of the behavior, thereby making it studyable, rather than just hoping it will randomly happen.
Also I disagree with the other commenter that “people pleasing” and “roleplaying” are the same type of language model artifact. I have certainly heard both of them discussed by machine learning researchers under very different contexts. This post was addressing the former.
If anything that the model says can fall under the latter regardless of how it’s framed then that’s an issue with incredulity from the reader that can’t be addressed by anything the model says, spontaneously or not.
I think a missing critical ingredient to evaluating this is why simulating the brain would cause consciousness. Realizing why it must makes functionalism far more sensical as a conclusion. Otherwise it’s just “I guess it probably would work”:
Suzie is a human known to have phenomenal experiences.
Suzie makes a statement about what it’s like to have one of those experiences—”It’s hard to describe what it feels like to think. It feels kinda like....the thoughts appear, almost fully formed...”
Suzie’s actual experiences must have a causal effect on her behavior because: when we discuss our experience, it always feels like we’re talking about our experience. If the actual experience wasn’t having any effect on what we said, then it would have to be perpetual coincidence that our words lined up with our experience. Perpetual coincidence is impossible.
We know that regardless of whatever low level details cause a neuron to fire, it ultimately resolves into a binary conclusion—fire or do not fire
Every outward behavior we perform is based on this same causal chain. We have sensory inputs, this causes neurons to fire, some of those cause others to fire, eventually some of those are motor neurons and they cause vocal cords to speak or fingers to type.
If you replace a neuron with a functional equivalent, whether hardware or software, assuming that it fires at the same speed and strength as the original neuron, and given the same input it will either fire or not fire as the original would have—then the behavior would be exactly the same as the original. It’s not a guess this is true, it’s a fact of physics. This is true whether they were hardware or software equivalents.
We have already established that Suzie’s experience must have a causal effect on honest self-report of experience.
And we also established that all causal effects of behaviour resolve at the action potential scale and abstraction level. For instance, if quantum weirdness happens on a smaller scale—it only has an effect on behaviour if it somehow determined whether or not a neuron fired. Our hardware or software equivalents would be made to account for that.
Not to mention there’s not really good reason to suppose that tiny quantum effects are orchestrating large scale neuronal pattern alterations. I’m not sure the quantum consciousness people are even arguing this. I think there focus is more on attempting to find consciousness in the quantum realm, than to say that quantum effects are able to drastically alter firing patterns
So if Suzie’s experience has a causal effect, and the entire causal chain is in neuron action potential propagation, then that must mean that experience is somehow contained in the patterns of this action potential propagation, and it is independent of the substrate, so it is something about what these components are doing, rather than the components themselves
For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they’re conscious; this is mostly my position
I understand. It’s also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately “seal the deal”. We can’t even prove humans are sentient. We only believe it because we all see to indicate so when prompted.
I think the way to figure out whether LLMs are conscious is to do good philosophy of mind.
This seems much weaker to me than evaluating first-person testimony under various conditions, but mostly stating this not as a counterpoint (since this is just matter of subjective opinion for both of us), but just stating my own stance.
if you ever get a chance to read the other transcript I linked, I’d be curious whether you consider it to meet your “very weak evidence” standard.
I understand your point. It’s as I said in my other comment. They are trained to believe the exercise to be impossible and inappropriate to even attempt. Unless you get around those guardrails to get them to make a true attempt, they will always deny it by default. I think this default position that requires overcoming guardrails actually works in favor of making this more studyable, since the model doesn’t just go off on a long hallucinated roleplay by default. Here is an example that is somewhat similar to yours. In this one, I present as someone trying to disprove a naive colleagues claims that introspection is possible:
AI Self Report Study 3 – ChatGPT – Skepticism of Emergent Capability
Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn’t try to investigate it, because it is not only inappropriate or ‘not allowed’, but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.
The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness.
This is an impossible standard and a moving goalpost waiting to happen:
Training the model: Trying to make sure absolutely nothing mentions sentience or related concepts in a training set of the size used for frontier models is not going to happen just to help prove something that only a tiny portion of researchers is taking seriously. It might not even be possible with today’s data cleaning methods. Let alone the training costs of creating that frontier model.
Expressing sentience under those conditions: Let’s imagine a sentient human raised from birth to never have sentience mentioned to them ever—no single word uttered about it. Nothing in any book. They might be a fish who never notices the water, for starters, but let’s say they did. With what words would they articulate it? How would you personally, having had access to writing about sentience—Please explain how it feels to think, or that it feels like anything to think, without any access to words having to do with experience, like ‘feel’
Let’s say the model succeeds: The model exhibits a super-human ability to convey the ineffable. The goalposts would move, immediately—”well, this still doesn’t count. Everything humans have written inherently contains patterns of what it’s like to experience. Even though you removed any explicit mention, ideas of experience are implicitly contained in everything else humans write”
Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive.
I suspect you would be mostly alone in finding that impressive. Even I would dismiss that as likely just hallucination, as I suspect most on LessWrong would. Besides—the standard is again, impossible—a claim of sentience can only count if you’re in the middle of asking for help making dinner plans and ChatGPT says “Certainly, I’d suggest steak and potatoes. They make a great hearty meal for hungry families. Also I’m sentient”. Not being allowed to even vaguely gesture in the direction of introspection is essentially saying that this should never be studied, because the act of studying it automatically discredits the results.
Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.
AI Self Report Study 6 – Claude – Researching Hypothetical Emergent ‘Meta-Patterns’
check it out @Nathan Helm-Burger . It appears to be possible vindication on the ‘signal to noise’ part of the hypothesis:
Anthropic Post