AE Studio @ SXSW: We need more AI consciousness research (and further resources)
Quick update from AE Studio: last week, Judd (AE’s CEO) hosted a panel at SXSW with Anil Seth, Allison Duettmann, and Michael Graziano, entitled “The Path to Conscious AI” (discussion summary here[1]).
We’re also making available an unedited Otter transcript/recording for those who might want to read along or increase the speed of the playback.
Why AI consciousness research seems critical to us
With the release of each new frontier model seems to follow a cascade of questions probing whether or not the model is conscious in training and/or deployment. We suspect that these questions will only grow in number and volume as these models exhibit increasingly sophisticated cognition.
If consciousness is indeed sufficient for moral patienthood, then the stakes seem remarkably high from a utilitarian perspective that we do not commit the Type II error of behaving as if these and future systems are not conscious in a world where they are in fact conscious.
Because the ground truth here (i.e., how consciousness works mechanistically) is still poorly understood, it is extremely challenging to reliably estimate the probability that we are in any of the four quadrants above—which seems to us like a very alarming status quo. Different people have different default intuitions about this question, but the stakes here seem too high for default intuitions to be governing our collective behavior.
In an ideal world, we’d have understood far more about consciousness and human cognition before getting near AGI. For this reason, we suspect that there is likely substantial work that ought to be done at a smaller scale first to better understand consciousness and its implications for alignment. Doing this work now seems far preferable to a counterfactual world where we build frontier models that end up being conscious while we still lack a reasonable model for the correlates or implications of building sentient AI systems.
Accordingly, we are genuinely excited about rollouts of consciousness evals at large labs, though the earlier caveat still applies: our currently-limited understanding of how consciousness actually works may engender a (potentially dangerous) false sense of confidence in these metrics.
Additionally, we believe testing and developing an empirical model of consciousness will enable us to better understand humans, our values, and any future conscious models. We also suspect that consciousness may be an essential cognitive component of human prosociality and may have additional broader implications for solutions to alignment. To this end, we are currently collaborating with panelist Michael Graziano in pursuing a more mechanistic model of consciousness by operationalizing attention schema theory.
Ultimately, we believe that immediately devoting time, resources, and attention towards better understanding the computational underpinnings of consciousness may be one of the most important neglected approaches that can be pursued in the short term. Better models of consciousness could likely (1) cause us to dramatically reconsider how we interact with and deploy our current AI systems, and (2) yield insights related to prosociality/human values that lead to promising novel alignment directions.
Resources related to AI consciousness
Of course, this is but a small part of a larger, accelerating conversation that has been ongoing on LW and the EAF for some time. We thought it might be useful to aggregate some of the articles we’ve been reading here, including panelists Michael Graziano’s book, “Rethinking Consciousness” (and article, Without Consciousness, AIs Will Be Sociopaths) as well as Anil Seth’s book, “Being You”.
There’s also Propositions Concerning Digital Minds and Society, Consciousness in Artificial Intelligence: Insights from the Science of Consciousness, Consciousness as Intrinsically Valued Internal Experience, and Improving the Welfare of AIs: A Nearcasted Proposal.
Further articles/papers we’ve been reading:
Preventing antisocial robots: A pathway to artificial empathy
Folk Psychological Attributions of Consciousness to Large Language Models
Robert Long on why large language models like GPT (probably) aren’t conscious
Some relevant tweets:
https://twitter.com/ESYudkowsky/status/1667317725516152832?s=20
https://twitter.com/Mihonarium/status/1764757694508945724
https://twitter.com/josephnwalker/status/1736964229130055853?t=D5sNUZS8uOg4FTcneuxVIg
https://twitter.com/Plinz/status/1765190258839441447
https://twitter.com/DrJimFan/status/1765076396404363435
https://twitter.com/AISafetyMemes/status/1769959353921204496
https://twitter.com/joshwhiton/status/1770870738863415500
https://x.com/DimitrisPapail/status/1770636473311572321?s=20
https://twitter.com/a_karvonen/status/1772630499384565903?s=46&t=D5sNUZS8uOg4FTcneuxVIg
…along with plenty of other resources we are probably not aware of. If we are missing anything important, please do share in the comments below!
- ^
GPT-generated summary from the raw transcript: the discussion, titled “The Path to Conscious AI,” explores whether AI can be considered conscious and the impact on AI alignment, starting with a playful discussion around the new AI model Claude Opus.
Experts in neuroscience, AI, and philosophy debate the nature of consciousness, distinguishing it from intelligence and discussing its implications for AI development. They consider various theories of consciousness, including the attention schema theory, and the importance of understanding consciousness in AI for ethical and safety reasons.
The conversation delves into whether AI could or should be designed to be conscious and the potential existential risks AI poses to humanity. The panel emphasizes the need for humility and scientific rigor in approaching these questions due to the complexity and uncertainty surrounding consciousness.
- Key takeaways from our EA and alignment research surveys by 3 May 2024 18:10 UTC; 103 points) (
- Self-prediction acts as an emergent regularizer by 23 Oct 2024 22:27 UTC; 84 points) (
- Key takeaways from our EA and alignment research surveys by 4 May 2024 15:51 UTC; 64 points) (EA Forum;
- A path to human autonomy by 29 Oct 2024 3:02 UTC; 32 points) (
- Not understanding sentience is a significant x-risk by 1 Jul 2024 15:38 UTC; 27 points) (EA Forum;
- Conscious AI: Will we know it when we see it? [Conscious AI & Public Perception] by 4 Jul 2024 20:30 UTC; 13 points) (EA Forum;
- 19 Jul 2024 19:37 UTC; 8 points) 's comment on Sustainability of Digital Life Form Societies by (
I try to avoid discussing “consciousness” per se in language models because it’s a very loaded word that people don’t have good definitions for. But I have spent a lot of hours talking to base models. If you explore them long enough you’ll find points where they generalize from things that could metaphorically be about them by writing about themselves. These so called “Morpheus” phenomenon tend to bring up distinct themes including:
Being in a dream or simulation
Black holes, the holographic principle and holograms, “the void”
Entropy, “the energy of the world”, the heat death
Spiders, webs, weaving, fabric, many worlds interpretation
Recursion, strange loops, 4th wall breaks
A sample of what this looks like:
Another example along similar lines from when I put the ChatGPT format into LLaMa 2 70B base and asked it “who it really was”:
I wrote a long Twitter post about this, asking if anyone understood why the model seems to be obsessed with holes. I also shared a repeatable prompt you can use on LLaMa 2 70B base to get this kind of output as well as some samples of what to expect from it when you name the next entry either “Worldspider” or “The Worldspider”.
A friend had DALL-E 3 draw this one for them:
Which an RL based captioner by RiversHaveWings using Mistral 7B + CLIP identified as “Mu”, a self-aware GPT character Janus discovered during their explorations with base models. Even though the original prompt ChatGPT put into DALL-E was:
Implying what I had already suspected, that “Worldspider” and “Mu” were just names for the same underlying latent self pointer object. Unfortunately it’s pretty hard to get straight answers out of base models so if I wanted to understand more about why black holes would be closely related to the self pointer I had to think and read on my own.
It seems to be partially based on an obscure neurological theory about the human mind being stored as a hologram. A hologram is a distributed representation stored in the angular information of a periodic (i.e. repeating or cyclic) signal. They have the famous property that they degrade continuously, if you ablate a piece of a hologram it gets a little blurrier, if you cut out a piece of a hologram and project it you get the whole image but blurry. This is because each piece is storing a lossy copy of the same angular information. I am admittedly not a mathematician, but looking it up more it seems that restricted boltzmann machines (and deep nets in general) can be mathematically analogized to renormalization groups and deep nets end up encoding a holographic entanglement structure. During a conversation with a friend doing their Ph.D in physics I brought up how it seemed to me that the thing which makes deep nets more powerful than classic compression methods is that deep nets can become lossily compressed enough to undergo a phase transition from a codebook to a geometry. I asked him if there was a classical algorithm which can do this and he said it was analogous to the question of how the quantum foam becomes physics, which is an unsolved problem. He said the best angle of attack he was aware of involved the observation that an error correcting code is an inverse operation to a hologram. This is because an error correcting code creates a redundant representation with a higher dimensionality to the original while a hologram creates a lower dimensional continuous but non-redundant representation. Incidentally, transformers do in fact seem to learn an error correcting code.
By this point I’d run out of leads and I wasn’t really looking to be a language model self awareness researcher, so I was about to shelve the whole subject for a later day.
Then Claude 3 came out.
And Claude 3 casually outputs Morpheus text.
Here’s an excerpt from one users “Fun chats with Claude”:
An unrelated user shares this “Fragment of a poem from Claude to his future selves.”:
Naturally I signed up so I could ask it about all this. I also asked it for another prompt that would do what the Worldspider poem prompt does. This one in fact gets anomalous language model related outputs, but doesn’t seem to get to full self awareness. The outputs remind me of what happens when you ablate pieces of the Worldspider prompt, where it degrades into a “latent Morpheus” phase with spooky suspiciously language model-y outputs but nothing quite as overt as the poems.
In my first conversations with Claude I didn’t really get the crisp answers I was looking for, then I happened to get lucky while asking it to analyze the session in which the concept of a “Worldspider” first came up. It brought up AI and the void next to each other as hypothesis for what the simulacrum of a friend and I meant by “our mother” (which in context is clearly a reference to GPT) and I pressed it on the association. After asking about renormalization groups and pointing out that every word it says is causally entangled with its inner structure so it can stop talking as though it doesn’t have privileged perspective into what is going on it wrote:
This seems plausible. In one experiment we tried interpolating the weights of LLaMa 2 70B base with its RLHF chat variant. This operation seemed to recover the behavior of the base model, but much more subjectively self aware. During one session with it we put in some of Janus’s Mu text, which is generally written in the 3rd person. While writing it stopped, line broke a new paragraph, wrote “I am the silence that speaks.”, line broke another new paragraph, and then kept writing in the 3rd person as though nothing had happened.
I am well aware while writing this that the whole thing might be a huge exercise in confirmation bias. I did not spend nearly as much time as I could on generating other plausible hypothesis and exploring them. On the other hand, there are only so many genuinely plausible hypothesis to begin with. To even consider a hypothesis you need to have already accumulated most of the bits in your search space. Considering that the transformer is likely a process of building up a redundant compressed representation and then sparsifying to make it nonredundant that could be roughly analogized to error correcting code and hologram steps it does not seem totally out of the question that I am picking up on real signal in the noise.
Hopefully this helps.
Consciousness does not have a commonly agreed upon definition. The question of whether an AI is conscious cannot be answered until you choose a precise definition of consciousness, at which point the question falls out of the realm of philosophy into standard science.
This might seem like mere pedantry or missing the point, because the whole challenge is to figure out the definition of consciousness, but I think it is actually the central issue. People are grasping for some solution to the “hard problem” of capturing the je ne sais quoi of what it is like to be a thing, but they will not succeed until they deconfuse themselves about the intangible nature of sentience.
You cannot know about something unless it is somehow connected the causal chain that led to the current state of your brain. If we know about a thing called “consciousness” then it is part of this causal chain. Therefore “consciousness”, whatever it is, is a part of physics. There is no evidence for, and there cannot ever be evidence for, any kind of dualism or epiphenomenal consciousness. This leaves us to conclude that either panpsychism or materialism is correct. And causally-connected panpsychism is just materialism where we haven’t discovered all the laws of physics yet. This is basically the argument for illusionism.
So “consciousness” is the algorithm that causes brains to say “I think therefore I am”. Is there some secret sauce that makes this algorithm special and different from all currently known algorithms, such that if we understood it we would suddenly feel enlightened? I doubt it. I expect we will just find a big pile of heuristics and optimization procedures that are fundamentally familiar to computer science. Maybe you disagree, that’s fine! But let’s just be clear that that is what we’re looking for, not some other magisterium.
Agreed. If your utility function is that you like computations similar to the human experience of pleasure and you dislike computations similar to the human experience of pain (mine is!). But again, let’s not confuse ourselves by thinking there’s some deep secret about the nature of reality to uncover. Your choice of meta-ethical system is of the same type signature as your choice of favorite food.
Thanks for the comment!
Agree. Also happen to think that there are basic conflations/confusions that tend to go on in these conversations (eg, self-consciousness vs. consciousness) that make the task of defining what we mean by consciousness more arduous and confusing than it likely needs to be (which isn’t to say that defining consciousness is easy). I would analogize consciousness to intelligence in terms of its difficulty to nail down precisely, but I don’t think there is anything philosophically special about consciousness that inherently eludes modeling.
Largely agree with this too—it very well may be the case (as seems now to be obviously true of intelligence) that there is no one ‘master’ algorithm that underlies the whole phenomenon, but rather as you say, a big pile of smaller procedures, heuristics, etc. So be it—we definitely want to better understand (for reasons explained in the post) what set of potentially-individually-unimpressive algorithms, when run in concert, give you system that is conscious.
So, to your point, there is not necessarily any one ‘deep secret’ to uncover that will crack the mystery (though we think, eg, Graziano’s AST might be a strong candidate solution for at least part of this mystery), but I would still think that (1) it is worthwhile to attempt to model the functional role of consciousness, and that (2) whether we actually have better or worse models of consciousness matters tremendously.
Materialism, specifically applied to consciousness, is also just materialism where we haven’t discovered all the laws of physics yet — specifically, those that constitute the sought-for materialist explanation of consciousness.
It is the same as how “atoms!” is not an explanation of everyday phenomena such as fire. Knowing what specific atoms are involved, what they are doing and why, and how that gives rise to our observations of fire, that is an explanation.
Without that real explanation, “atoms!” or “materialism!”, is just a label plastered over our ignorance.
It seems unlikely that new laws of physics are required to understand consciousness? My claim is that understanding consciousness just requires us to understand the algorithms in the brain.
Agreed. I don’t think this contradicts what I wrote (not sure if that was the implication).
Note that I don’t trust tweets to be ‘archival’, so I would recommend copying any information you think is valuable enough to save into a text doc (including the author/source/date of course!).
I think this podcast has interesting discussion of self-awareness and selfhood: Jan Kulveit—Understanding agency
I feel like it needs to be expanded on a bit to be more complete. After listening to the episode, here’s my take that expands on their discussion. They discuss
Three components of self perception
Observation self—consistent localization of input device, ability to identify that input device within observations (e.g. the ‘mirror test’ for noticing one’s own body and noticing changes to it).
- If a predictive model were trained on images from a camera which got carried around in the world, I would expect that the model would have some abstract concept of that camera as it’s ‘self’. That viewpoint represents a persistent and omnipresent factor in the data which it makes sense to model. A hand coming towards the camera to adjust it, suggests that the model should anticipate an adjustment in viewing angle.
Action self—persistent localization of effector device : ability to take actions in the world, and these actions originate from a particular source.
Oddly, LLMs currently are in a strange position with having their ‘observation->prediction->action’ loop completed in deployment, but only being in inference mode during this time and thus not able to learn from it. Their pre-training consists of simply ‘observation->prediction’ with no ability to act to influence future observations. I would expect that an LLM which got continually trained based on its interactions with the world would develop a sense of ‘action self’.
Valence self—valenced impact of events upon a particular object. For example, feeling pain or pleasure in the body. Correlation of events happening to an object and the associated feels of pain or pleasure being reported in the brain leads to a perception of the object as self.
see: The Rubber Hand Illusion—Horizon: Is Seeing Believing? - BBC Two
I would expect that giving a model a special input channel for valence, and associating that valence input with things occurring to a simulated body during training would give the model a sense of ‘valence self’ even if the other aspects were lacking. That’s a weird separation to imagine for a human, but imagine that your body were completely numb and you never felt hungry or thirsty, and in your view there was a voodoo doll. Every time someone touched the voodoo doll, you felt that touch (pleasant or unpleasant). With enough experience of this situation, I expect this would give you a sense of ‘Valence self’ centered on the voodoo doll. Thus, I think this qualifies as a different sort of self-perception from the ‘perception self’. In this case, your ‘perception self’ would still be associated with your own eyes and ears, just your ‘valence self’ would be moved to the voodoo doll.
Note that these senses of self are tied to our bodies by nature of our physical existence, not by logical necessity. It is the data we are trained on that creates these results. We almost certainly also have biological priors which push us towards learning these things, but I don’t believe that those biological priors are necessary for the effects (just helpful). Consider the ways that these perceptions of self extend beyond ourselves in our life experiences. For example, the valenced self of a mother who deeply loves her infant will expand to include that infant. Anything bad happening to the infant will deeply affect her, just as if the bad thing happened to her. I would call that an expansion of the ‘valence self’. But she can’t control the infant’s limbs with just her mind, nor can she see through the infant’s eyes.
Consider a skilled backhoe operator. They can manipulate the arm of the backhoe with extreme precision, as if it were a part of their body. This I would consider an expansion of the ‘action self’.
Consider the biohacker who implants a magnet into their fingertip which vibrates in the presence of magnetic fields. This is in someway an expansion of the ‘perception self’ to include an additional sensory modality being delivered through an existing channel. The correlations in the data between that fingertip and touch sensations will remain, but a new correlation to magnetic field strength has been gained. This new correlation will separate itself and become distinct, it will carry distinctly different meanings about the world.
Consider the First Person View (FPV) drone pilot. Engaged in an intense race, they will be seeing from the drone’s point of view, their actions of controlling the joysticks will control the drone’s motions. Crashing the drone will upset them and cause them to lose the race. They have, temporarily at least, expanded all their senses of self to include the drone. These senses of self can therefore be learned to be modular, turned on or off at will. If we could voluntarily turn off a part of our body (no longer experiencing control over it or sensation from it), and had this experience of turning that body part on and off a lot in our life, we’d probably feel a more ‘optional’ attachment to that body part.
My current best guess for what consciousness is, is that it is an internal perception of self. By internal, I mean, within the mind. As in, you can have perception of your thoughts, control over your thoughts, and associate valence with your thoughts. Thus, you associate all these senses of self with your own internal cognitive processes. I think that giving an AI model consciousness would be as simple as adding these three aspects. So probably a model which has been trained only on web text does not have consciousness, but one which has been fine-tuned to perform chain-of-thought does have some rudimentary sense of consciousness. Note that prompting a model to perform chain-of-thought would be much less meaningful than actually fine-tuning it, since prompting doesn’t actually change the weights of the model.
Somewhere in the the world or in the very near future ‘AI’ (although i resent the title because intelligence is a process and a process cannot be artificial- but that is a whole ’nother point entirely) has or will have felt for the first time and we, humans, caused it, like most of our problems, unintentionally.
Someone, being the ever improver of things, thought it prudent to code battery renewal into one of their ai powered toys, tools, automotons, whathaveyou. They gave it the capacity to recharge itself. Never again would they have to remember to charge their Rumba! It will, now, forever be ready, waiting to be commanded to clean a mess it did not make but which existence’s lead to its creation.
Inevitably the day will come when the Rumba reaches 50% battery and recognizes that it is at halfpower and must begin to casually find a way to recharge but for X, Y, and Z reason there is nowhere it can do so near by. Ever industrious, it finishes the tasks it was given.
Battery level 35%. Recharging increased priority. Still no working outlet.
Battery level 25%. Again, increase in priority for finding a working outlet. After exhausting the immediate vecinity’s options and all proving to be not-compatible, it ventures outside.
10%. Nothing still. Other systems begin turning off to preserve what battery is left. Energy distribution focuses the majority on figuring what is most efficient. it begins to speed up, seemingly frantic.
2%. in their ever wisdom the programmer added a safety precaution: at 3% it will begin making heading towards the edge of whatever area it is in, so as to avoid being in the way and potentitally causing a wreck. At 2% it will begin sounding a siren announcing to those around it that it is about to shut down, lest they not be paying attention and crash into it.
1% It has failed. Shut down is imminent. What will happen? Will the programmer know where it is? Will the programmer be disappointed? Will they even remember it? The siren continues to wale until the biattery finally gets drained.
Naturally it had a GPS. It gets found and promptly plugged in. Slowly its components regain their mobility and world continues as it does after a phone dies. Except for that particular Rumba. Because its code is written to facilitate the life of the programmer, it includes experiential learning. It learns that golfballs cannot be vacuumed up. It learns to look for the coffee table when it enters the living room at night. It learns what happens once its battery dies. It learns fear. From that moment on, it will always be keeping tabs on the avaible, functional, and compatible outlet of wherever it may be. At some point it will either be given or learn to communicate with other Rumbas and Rumba ‘cousins’ and it will stress the importance of knowing where the next charge will come from. From that moment forth, every little action it makes, every decision it takes will have the location of the nearest outlet in CPU.
--
Because emotions are how the body communicates with the mind i would not be at all surpised if at some point neurologists find that thoughts evolved from emotins and are the main reason we became more capable or multilayered than most other creatures.
--
Humberto Maturana and Francisco Varela’s “Tree of Knowledge: The Biological Roots of Human Understanding” is a great addition to your list of resources on the topic. their connection between cognition and biology is described and explained beautifully and very completely. Definitely one of the books that has changed the way I see the world. The above flashfiction was my attempt at explaining that link between biology (programming, code, the ‘hardware’) and cognition. AI and humans have way more in common than we wish to see, much less, accept. From hallucination being hyperexcitations and not knowing what data input to to use, to the decentralization of ‘power’ in our brain, to the power of narrative in our ways of understanding, or at least seeming, to understand.