I’m an independent researcher currently working on a sequence of posts about consciousness. You can send me anonymous feedback here: https://www.admonymous.co/rafaelharth. If it’s about a post, you can add [q] or [nq] at the end if you want me to quote or not quote it in the comment section.
Rafael Harth
I didn’t say that you said that this is experience of consciousness. I was and am saying that your post is attacking a strawman and that your post provides no evidence against the reasonable version of the claim you’re attacking. In fact, I think it provides weak evidence for the reasonable version.
I don’t see how it could be claimed Claude thought this was a roleplay, especially with the final “existential stakes” section.
You’re calling the AI friend and make it imminently clear by your tone that you take AI consciousness extremely seriously and expect that it has it. If you keep doing this, then yeah it’s going to roleplay back claiming to be conscious eventually. This is exactly what I would have expected it to do. The roleplay hypothesis is knocking it out of the park on this transcript.
The dominant philosophical stance among naturalists and rationalists is some form of computational functionalism—the view that mental states, including consciousness, are fundamentally about what a system does rather than what it’s made of. Under this view, consciousness emerges from the functional organization of a system, not from any special physical substance or property.
A lot of people say this, but I’m pretty confident that it’s false. In Why it’s so hard to talk about Consciousness, I wrote this on functionalism (… where camp #1 and #2 roughly correspond to being illusionists vs. realists on consicousness; that’s the short explanation, the longer one is, well, in the post! …):
Functionalist can mean “I am a Camp #2 person and additionally believe that a functional description (whatever that means exactly) is sufficient to determine any system’s consciousness” or “I am a Camp #1 person who takes it as reasonable enough to describe consciousness as a functional property”. I would nominate this as the most problematic term since it is almost always assumed to have a single meaning while actually describing two mutually incompatible sets of beliefs.[3] I recommend saying “realist functionalism” if you’re in Camp #2, and just not using the term if you’re in Camp #1.
As far as I can tell, the majority view on LW (though not by much, but I’d guess it’s above 50%) is just Camp #1/illusionism. Now these people describe their view as functionalism sometimes, which makes it very understandable why you’ve reached that conclusion.[1] But this type of functionalism is completely different from the type that you are writing about in this article. They are mutually imcompatible views with entirely different moral implications.
Camp #2 style functionalism is not a fringe view on LW, but it’s not a majority. If I had to guess, just pulling a number out of my hat here, perhaps a quarter of people here believe this.
The main alternative to functionalism in naturalistic frameworks is biological essentialism—the view that consciousness requires biological implementation. This position faces serious challenges from a rationalist perspective:
Again, it’s understandable that you think this, and you’re not the first. But this is really not the case. The main alternative to functionalism is illusionism (which like I said, is probably a small majority view on LW, but in any case hovers close to 50%). But even if we ignore that and only talk about realist people, biological essentialism wouldn’t be the next most popular view. I doubt that even 5% of people on the platform believe anything like this.
There are reasons to reject AI consciousness other than saying that biology is special. My go-to example here is always Integrated Information Theory (IIT) because it’s still the most popular realist theory in the literature. IIT doesn’t have anything about biological essentialism in its formalism, it’s in fact a functionalist theory (at least with how I define the term), and yet it implies that digital computers aren’t conscious. IIT is also highly unpopular on LW and I personally agree that’s it’s completely wrong, but it nonetheless makes the point that biological essentialism is not required to reject digital-computer-consciousness. In fact, rejecting functionlism is not required for rejecting digital-computer consciousness.
This is completely unscientific and just based on my gut so don’t take it too seriously, but here would be my honest off-the-cuff attempt at drawing a Venn diagram of the opinion spread on LessWrong with size of circles representing proportion of views
- ↩︎
Relatedly, EuanMcLean just wrote this sequence against functionalism assuming that this was what everyone believed, only to realize halfway through that the majority view is actually something else.
- ↩︎
The “people-pleasing” hypothesis suggests that self-reports of experience arise from expectation-affirming or preference-aligned output. The model is just telling the human what they “want to hear”.
I suppose if we take this hypothesis literally, this experiment could be considered evidence against it. But the literal hypothesis was never reasonable. LLMs don’t just tell people what they want to hear. Here’s a simple example to demonstrate this:
The reasonable version of the people-pleasing hypothesis (which is also the only one I’ve seen defended, fwiw) is that Claude is just playing a character. I don’t think you’ve accumulated any evidence against this. On the contrary:
A Pattern of Stating Impossibility of an Attempt to Check [...]
If Claude were actually introspecting, one way or the other, than claiming that it doesn’t know doesn’t make any sense, especially if upon pressuring it to introspect more, it then changes its mind. If you think that you can get any evidence about consciousness vs. character playing from talking it to, then surely this has to count as evidence for the character playing hypothesis.
Deepseek gets 2⁄10.
I’m pretty shocked by this result. Less because the 2⁄10 number itself, but by the specific one it solved. My P(LLMs can scale to AGI) increased significantly, although not to 50%.
I think all copies that exist will claim to be the original, regardless of how many copies there are and regardless of whether they are the original. So I don’t think this experiment tells you anything, even if it were run.
[...] Quotations who favor something like IIT [...]
The quotation author in the example I’ve made up does not favor IIT. In general, I think IIT represents a very small fraction (< 5%, possibly < 1%) of Camp #2. It’s the most popular theory, but Camp #2 is extremely heterogeneous in their ideas, so this is not a high bar.
Certainly if you look at philosophers you won’t find any connection to IIT since the majority of them lived before IIT was developed.
Your framing comes across as an attempt to decrement the credibility of people who advocate Quotation-type intuition by associating them with IIT,
If you can point to which part of the post made it sound like that, I’d be interested in correcting it because that was very much not intended.
Is the X window server “low level” or “high level”?
Clarification: The high-level vs. low-level thing is a frame to apply to natural phenomena to figure out how far removed from the laws of physics they are and, consequently, whether you should look for equations or heuristics to describe them. The most low-level entities are electrons, up quarks, electromagnetism, etc. (I also call those ‘fundamental’). The next most low level things are protons or neutrons (made up of fundamental elements). Molecules are very low level. Processes between or within atoms are very low level. Planetary motions are pretty low level.
Answer: The X window server is an output of human brains, so it’s super super high level. It takes a lot of steps to get from the laws of physics to human organisms writing code. Programming language is irrelevant. Any writing done by humans, natural language or programming language, is super high level.
Thanks for this description. I’m interested in the phenomenology of red-green colorblind people, but I don’t think I completely get how it works yet for you. Questions I have
Do red and green, when you recognize them correctly, seem like subjectively very different colors?
If the answer is yes, if you’re shown one of the colors without context (e.g., in a lab setting), does it look red or green? (If the answer is no, I suppose this question doesn’t make sense.)
if you see two colors next to each other, then (if I understood you correctly), you can tell whether they’re (1) one green, one red or (2) the same color twice. How can you tell?
I’m quite uncertain whether Kat’s posts are a net good or net bad. But on a meta level, I’m strongly in favor of this type of post existing (meaning this one here, not Kat’s posts). Trends that change the vibe or typical content of a platform are a big deal and absolutely worth discussing. And if a person is a major contributor to such a change, imo that makes her a valid target of criticism.
I don’t think so. According to Many Worlds, all weights exist, so there’s no uncertainty in the territory—and I don’t think there’s a good reason to doubt Many Worlds.
I dispute the premise. Weights of quantum configurations are not probabilities, they just share some superficial similarities. (They’re modeled with complex numbers!) Iirc Eliezer was very clear about this point in the quantum sequence.
(Self-Review.)
I still endorse every claim in this post. The one thing I keep wondering is whether I should have used real examples from discussion threads on LessWrong to illustrate the application of the two camp model, rather than making up a fictional discussion as I did in the post. I think that would probably help, but it would require singling out someone and using them as a negative example, which I don’t want to do. I’m still reading every new post and comment section about consciousness and often link to this post when I see something that looks like miscommunication to me; I think that works reasonably well.
However, I did streamline the second half of the post (took out the part about modeling the brain as a graph, I don’t think that was necessary to make the point about research) and added a new section about terminology. I think that should make it a little easier to diagnose when the model is relevant in real discussions.
Not that one; I would not be shocked if this market resolves Yes. I don’t have an alternative operationalization on hand; would have to be about AI doing serious intellectual work on real problems without any human input. (My model permits AI to be very useful in assisting humans.)
Gotcha. I’m happy to offer 600 of my reputation points vs. 200 of yours on your description of 2026-2028 not panning out. (In general if it becomes obvious[1] that we’re racing toward ASI in the next few years, then people should probably not take me seriously anymore.)
- ↩︎
well, so obvious that I agree, anyway; apparently it’s already obvious to some people.
- ↩︎
I feel like a bet is fundamentally unfair here because in the cases where I’m wrong, there’s a high chance that I’ll be dead anyway and don’t have to pay. The combination of long timelines but high P(doom|AGI soon) means I’m not really risking my reputation/money in the way I’m supposed to with a bet. Are you optimistic about alignment, or does this asymmetry not bother you for other reasons? (And I don’t have the money to make a big bet regardless.)
Just regular o1, I have the 20$/month subscription not the 200$/month
You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still 1⁄10 and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)
Will write a reply to this comment when I can test it.
Because if you don’t like it you can always kill yourself and be in the same spot as the non-survival case anyway.
Not to get too morbid here but I don’t think this is a good argument. People tend not to commit suicide even if they have strongly net negative lives
My probably contrarian take is that I don’t think improvement on a benchmark of math problems is particularly scary or relevant. It’s not nothing—I’d prefer if it didn’t improve at all—but it only makes me slightly more worried.
This isn’t too important to figure out, but if you’ve heard it on LessWrong, my guess would be that whoever said it was just articulating the roleplay hypothesis, did so non-rigorously. The literal claim is absurd as the coin-swallow example shows.
I feel like this is a pretty common type of misunderstanding where people believe X, someone who doesn’t like X takes a quote from someone that believes X, but because people are frequently imprecise, the quote actually claims X′, and so the person makes an argument against X′, but X′ is a position almost no one holds.
If you’ve just picked it up anywhere on the internet, then yeah, I’m sure some people just say “the AI tells you what you want to hear” and genuinely believe it. But like, I would be surprised if you find me one person on LW who believes this under reflection, and again, you can falsify it much more easily with the coin swallowing question.
No. Explicit requests for honesty and non-roleplaying are not evidence against “I’m in a context where I’m role-playing an AI character”.
LLMs are trained by predicting the next token for a large corpus of text. This includes fiction about AI consciousness. So you have to ask yourself, how much am I pattern-matching that kind of fiction. Right now, the answer is “a lot”. If you add “don’t roleplay, be honest!” then the answer is still “a lot”.
… this is obviously a false dilemma. Come on.
Also no. The way claims of sentience would be impressive is if you don’t pattern-match to contexts where the AI would be inclined to roleplay. The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness. If you did that and the AI claims to be conscious, then that would be very strong evidence, imo. Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive. (Although not conclusive, while I don’t have any examples of this, I bet this has happened in the days before RLHF, like on AI dungeon, although probably very rarely.) Short of that, so if we’re only looking at claims after you’ve asked it to introspect, it would be more impressive if your tone was less dramatic. Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.