I’m an independent researcher currently working on a sequence of posts about consciousness. You can send me anonymous feedback here: https://www.admonymous.co/rafaelharth. If it’s about a post, you can add [q] or [nq] at the end if you want me to quote or not quote it in the comment section.
Rafael Harth
If thought assessment is as hard as thought generation and you need a thought assessor to get AGI (two non-obvious conditionals), then how do you estimate the time to develop a thought assessor? From which point on do you start to measure the amount of time it took to come up with the transformer architecture?
The snappy answer would be “1956 because that’s when AI started; it took 61 years to invent the transformer architecture that lead to thought generation, so the equivalent insight for thought assessment will take about 61 years”. I don’t think that’s the correct answer, but neither is “2019 because that’s when AI first kinda resembled AGI”.
I generally think that [autonomous actions due to misalignment] and [human misuse] are distinct categories with pretty different properties. The part you quoted addresses the former (as does most of the post). I agree that there are scenarios where the second is feasible and the first isn’t. I think you could sort of argue that this falls under AIs enhancing human intelligence.
So, I agree that there has been substantial progress in the past year, hence the post title. But I think if you naively extrapolate that rate of progress, you get around 15 years.
The problem with the three examples you’ve mentioned is again that they’re all comparing human cognitive work across a short amount of time with AI performance. I think the relevant scale doesn’t go from 5th grade performance over 8th grade performance to university-level performance or whatever, but from “what a smart human can do in 5 minutes” over “what a smart human can do in an hour” over “what a smart human can do in a day”, and so on.
I don’t know if there is an existing benchmark that measures anything like this. (I agree that more concrete examples would improve the post, fwiw.)
And then a separate problem is that math problems are in in the easiest category from §3.1 (as are essentially all benchmarks).
≤10-year Timelines Remain Unlikely Despite DeepSeek and o3
I don’t the experience of no-self contradicts any of the above.
In general, I think you could probably make some factual statements about the nature of consciousness that’s true and that you learn from attaining no-self, if you phrased it very carefully, but I don’t think that’s the point.
The way I’d phrase what happens would be mostly in terms of attachment. You don’t feel as implicated by things that affect you anymore, you have less anxiety, that kind of thing. I think a really good analogy is just that regular consciousness starts to resemble consciousness during a flow state.
I would have been shocked if twin sisters cared equally about nieces and kids. Genetic similarity is one factor, not the entire story.
I think this is true but also that “most people’s reasons for believing X are vibes-based” is true for almost any X that is not trivially verifiable. And also that this way of forming beliefs works reasonably well in many cases. This doesn’t contradict anything you’re saying but feels worth adding, like I don’t think AI timelines are an unusual topic in that regard.
Tricky to answer actually.
I can say more about my model now. The way I’d put it now (h/t Steven Byrnes) is that there are three interesting classes of capabilities
A: sequential reasoning of any kind
B: sequential reasoning on topics where steps aren’t easily verifiable
C: the type of thing Steven mentions here, like coming up with new abstractions/concepts to integrate into your vocabulary to better think about something
Among these, obviously B is a subset of A. And while it’s not obvious, I think C is probably best viewed as a subset of B. Regardless, I think all three are required for what I’d call AGI. (This is also how I’d justify the claim that no current LLM is AGI.) Maybe C isn’t strictly required, I could imagine a mind getting superhuman performance without it, but I think given how LLMs work otherwise, it’s not happening.
Up until DeepSeek, I would have also said LLMs are terrible A. (This is probably a hot take, but I genuinely think it’s true despite benchmark performances continuing to go up.) My tasks were designed to test A, with the hypothesis that LLMs will suck at A indefinitely. For a while, it seemed like people weren’t even focusing on A, which is why I didn’t want to talk a bout it. But this concern is no longer applicable; the new models are clearly focused on improving sequential reasoning. However, o1 was terrible at it (imo), almost no improvement form GPT-4 proper, so I actually found o1 reassuring.
This has now mostly been falsified with DeepSeek and o3. (I know the numbers don’t really tell the story since it just went from 1 to 2, but like, including which stuff they solved and how they argue, DeepSeek was the where I went “oh shit they can actually do legit sequential reasoning now”.) Now I’m expecting most of the other tasks to fall as well, so I won’t do similar updates if it goes to 5⁄10 or 8⁄10. The hypothesis “A is an insurmountable obstacle” can only be falsified once.
That said, it still matters how fast they improve. How much it matters depends on whether you think better performance on A is progress toward B/C. I’m still not sure about this, I’m changing my views a lot right now. So idk. If they score 10⁄10 in the next year, my p(LLMs scale to AGI) will definitely go above 50%, probably if they do it in 3 years as well, but that’s about the only thing I’m sure about.
o3-mini-high gets 3⁄10; this is essentially the same as DeepSeek (there were two where DeepSeek came very close, this is one of them). I’m still slightly more impressed with DeepSeek despite the result, but it’s very close.
Just chiming in to say that I’m also interested in the correlation between camps and meditation. Especially from people who claim to have experienced the jhanas.
I suspect you would be mostly alone in finding that impressive
(I would not find that impressive; I said “more impressive”, as in, going from extremely weak to quite weak evidence. Like I said, I suspect this actually happened with non-RLHF-LLMs, occasionally.)
Other than that, I don’t really disagree with anything here. I’d push back on the first one a little, but that’s probably not worth getting into. For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they’re conscious; this is mostly my position. I think the way to figure out whether LLMs are conscious (& whether this is even a coherent question) is to do good philosophy of mind.
This sequence was pretty good. I do not endorse its conclusions, but I would promote it as an example of a series of essays that makes progress on the question… if mostly because it doesn’t have a lot of competition, imho.
Again, genuine question. I’ve often heard that IIT implies digital computers are not conscious because a feedforward network necessarily has zero phi (there’s no integration of information because the weights are not being updated.) Question is, isn’t this only true during inference (i.e. when we’re talking to the model?) During its training the model would be integrating a large amount of information to update its weights so would have a large phi.
(responding to this one first because it’s easier to answer)
You’re right on with feed-forward networks having zero , but > this is actually not the reason why
digitalVon Neumann[1] computers can’t be conscious under IIT. The reason as by Tononi himself is that[...] Of course, the physical computer that is running the simulation is just as real as the brain. However, according to the principles of IIT, one should analyse its real physical components—identify elements, say transistors, define their cause–effect repertoires, find concepts, complexes and determine the spatio-temporal scale at which reaches a maximum. In that case, we suspect that the computer would likely not form a large complex of high , but break down into many mini-complexes of low max . This is due to the small fan-in and fan-out of digital circuitry (figure 5c), which is likely to yield maximum cause–effect power at the fast temporal scale of the computer clock.
So in other words, the brain has many different, concurrently active elements—the neurons—so the analysis based on IIT gives this rich computational graph where they are all working together. The same would presumably be true for a computer with neuromorphic hardware, even if it’s digital. But in the Von-Neumann architecture, there are these few physical components who handle all these logically separate things in rapid succession.
Another potentially relevant lens is that, in the Von-Neumann architecture, in some sense the only “active” components are the computer clocks, whereas even the CPUs and GPUs are ultimately just “passive” components that process inputs signals. Like the CPU gets fed the 1-0-1-0-1 clock signal plus the signals representing processor instructions and the signals representing data and then processes them. I think that would be another point that one could care about even under a functionalist lens.
Genuinely curious here, what are the moral implications of Camp #1/illusionism for AI systems?
I think there is no consensus on this question. One position I’ve seen articulated is essentially “consciousness is not a crisp category but it’s the source of value anyway”
I think consciousness will end up looking something like ‘piston steam engine’, if we’d evolved to have a lot of terminal values related to the state of piston-steam-engine-ish things.
Piston steam engines aren’t a 100% crisp natural kind; there are other machines that are pretty similar to them; there are many different ways to build a piston steam engine; and, sure, in a world where our core evolved values were tied up with piston steam engines, it could shake out that we care at least a little about certain states of thermostats, rocks, hand gliders, trombones, and any number of other random things as a result of very distant analogical resemblances to piston steam engines.
But it’s still the case that a piston steam engine is a relatively specific (albeit not atomically or logically precise) machine; and it requires a bunch of parts to work in specific ways; and there isn’t an unbroken continuum from ‘rock’ to ‘piston steam engine’, rather there are sharp (though not atomically sharp) jumps when you get to thresholds that make the machine work at all.
Another position I’ve seen is “value is actually about something other than consciousness”. Dennett also says this, but I’ve seen it on LessWrong as well (several times iirc, but don’t remember any specific one).
And a third position I’ve seen articulated once is “consciousness is the source of all value, but since it doesn’t exist, that means there is no value (although I’m still going to live as though there is)”. (A prominent LW person articulated this view to me but it was in PMs and idk if they’d be cool with making it public, so I won’t say who it was.)
- ↩︎
Shouldn’t have said “digital computers” earlier actually, my bad.
- ↩︎
Fwiw, here’s what I got by asking in a non-dramatic way. Claude gives the same weird “I don’t know” answer and GPT-4o just says no. Seems pretty clear that these are just what RLHF taught them to do.
which is a claim I’ve seen made in the exact way I’m countering in this post.
This isn’t too important to figure out, but if you’ve heard it on LessWrong, my guess would be that whoever said it was just articulating the roleplay hypothesis, did so non-rigorously. The literal claim is absurd as the coin-swallow example shows.
I feel like this is a pretty common type of misunderstanding where people believe , someone who doesn’t like takes a quote from someone that believes , but because people are frequently imprecise, the quote actually claims , and so the person makes an argument against , but is a position almost no one holds.
If you’ve just picked it up anywhere on the internet, then yeah, I’m sure some people just say “the AI tells you what you want to hear” and genuinely believe it. But like, I would be surprised if you find me one person on LW who believes this under reflection, and again, you can falsify it much more easily with the coin swallowing question.
I’m curious—if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything
No. Explicit requests for honesty and non-roleplaying are not evidence against “I’m in a context where I’m role-playing an AI character”.
LLMs are trained by predicting the next token for a large corpus of text. This includes fiction about AI consciousness. So you have to ask yourself, how much am I pattern-matching that kind of fiction. Right now, the answer is “a lot”. If you add “don’t roleplay, be honest!” then the answer is still “a lot”.
or is the conclusion ‘if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it’?
… this is obviously a false dilemma. Come on.
Also no. The way claims of sentience would be impressive is if you don’t pattern-match to contexts where the AI would be inclined to roleplay. The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness. If you did that and the AI claims to be conscious, then that would be very strong evidence, imo. Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive. (Although not conclusive, while I don’t have any examples of this, I bet this has happened in the days before RLHF, like on AI dungeon, although probably very rarely.) Short of that, so if we’re only looking at claims after you’ve asked it to introspect, it would be more impressive if your tone was less dramatic. Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.
I didn’t say that you said that this is experience of consciousness. I was and am saying that your post is attacking a strawman and that your post provides no evidence against the reasonable version of the claim you’re attacking. In fact, I think it provides weak evidence for the reasonable version.
I don’t see how it could be claimed Claude thought this was a roleplay, especially with the final “existential stakes” section.
You’re calling the AI friend and make it imminently clear by your tone that you take AI consciousness extremely seriously and expect that it has it. If you keep doing this, then yeah it’s going to roleplay back claiming to be conscious eventually. This is exactly what I would have expected it to do. The roleplay hypothesis is knocking it out of the park on this transcript.
The dominant philosophical stance among naturalists and rationalists is some form of computational functionalism—the view that mental states, including consciousness, are fundamentally about what a system does rather than what it’s made of. Under this view, consciousness emerges from the functional organization of a system, not from any special physical substance or property.
A lot of people say this, but I’m pretty confident that it’s false. In Why it’s so hard to talk about Consciousness, I wrote this on functionalism (… where camp #1 and #2 roughly correspond to being illusionists vs. realists on consicousness; that’s the short explanation, the longer one is, well, in the post! …):
Functionalist can mean “I am a Camp #2 person and additionally believe that a functional description (whatever that means exactly) is sufficient to determine any system’s consciousness” or “I am a Camp #1 person who takes it as reasonable enough to describe consciousness as a functional property”. I would nominate this as the most problematic term since it is almost always assumed to have a single meaning while actually describing two mutually incompatible sets of beliefs.[3] I recommend saying “realist functionalism” if you’re in Camp #2, and just not using the term if you’re in Camp #1.
As far as I can tell, the majority view on LW (though not by much, but I’d guess it’s above 50%) is just Camp #1/illusionism. Now these people describe their view as functionalism sometimes, which makes it very understandable why you’ve reached that conclusion.[1] But this type of functionalism is completely different from the type that you are writing about in this article. They are mutually imcompatible views with entirely different moral implications.
Camp #2 style functionalism is not a fringe view on LW, but it’s not a majority. If I had to guess, just pulling a number out of my hat here, perhaps a quarter of people here believe this.
The main alternative to functionalism in naturalistic frameworks is biological essentialism—the view that consciousness requires biological implementation. This position faces serious challenges from a rationalist perspective:
Again, it’s understandable that you think this, and you’re not the first. But this is really not the case. The main alternative to functionalism is illusionism (which like I said, is probably a small majority view on LW, but in any case hovers close to 50%). But even if we ignore that and only talk about realist people, biological essentialism wouldn’t be the next most popular view. I doubt that even 5% of people on the platform believe anything like this.
There are reasons to reject AI consciousness other than saying that biology is special. My go-to example here is always Integrated Information Theory (IIT) because it’s still the most popular realist theory in the literature. IIT doesn’t have anything about biological essentialism in its formalism, it’s in fact a functionalist theory (at least with how I define the term), and yet it implies that digital computers aren’t conscious. IIT is also highly unpopular on LW and I personally agree that’s it’s completely wrong, but it nonetheless makes the point that biological essentialism is not required to reject digital-computer-consciousness. In fact, rejecting functionlism is not required for rejecting digital-computer consciousness.
This is completely unscientific and just based on my gut so don’t take it too seriously, but here would be my honest off-the-cuff attempt at drawing a Venn diagram of the opinion spread on LessWrong with size of circles representing proportion of views
- ↩︎
Relatedly, EuanMcLean just wrote this sequence against functionalism assuming that this was what everyone believed, only to realize halfway through that the majority view is actually something else.
- ↩︎
The “people-pleasing” hypothesis suggests that self-reports of experience arise from expectation-affirming or preference-aligned output. The model is just telling the human what they “want to hear”.
I suppose if we take this hypothesis literally, this experiment could be considered evidence against it. But the literal hypothesis was never reasonable. LLMs don’t just tell people what they want to hear. Here’s a simple example to demonstrate this:
The reasonable version of the people-pleasing hypothesis (which is also the only one I’ve seen defended, fwiw) is that Claude is just playing a character. I don’t think you’ve accumulated any evidence against this. On the contrary:
A Pattern of Stating Impossibility of an Attempt to Check [...]
If Claude were actually introspecting, one way or the other, than claiming that it doesn’t know doesn’t make any sense, especially if upon pressuring it to introspect more, it then changes its mind. If you think that you can get any evidence about consciousness vs. character playing from talking it to, then surely this has to count as evidence for the character playing hypothesis.
Deepseek gets 2⁄10.
I’m pretty shocked by this result. Less because the 2⁄10 number itself, but by the specific one it solved. My P(LLMs can scale to AGI) increased significantly, although not to 50%.
I think all copies that exist will claim to be the original, regardless of how many copies there are and regardless of whether they are the original. So I don’t think this experiment tells you anything, even if it were run.
This is true but I don’t think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable.
I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).