There are sea slugs that photosynthesize, but that’s with chloroplasts they steal from the algae they eat.
Nate Showell
As I use the term, the presence or absence of an emotional reaction isn’t what determines whether someone is “feeling the AGI” or not. I use it to mean basing one’s AI timeline predictions on a feeling.
Getting caught up in an information cascade that says AGI is arriving soon. A person who’s “feeling the AGI” has “vibes-based” reasons for their short timelines due to copying what the people around them believe. In contrast, a person who looks carefully at the available evidence and formulates a gears-level model of AI timelines is doing something different than “feeling the AGI,” even if their timelines are short. “Feeling” is the crucial word here.
The phenomenon of LLMs converging on mystical-sounding outputs deserves more exploration. There might be something alignment-relevant happening to LLMs’ self-models/world-models when they enter the mystical mode, potentially related to self-other overlap or to a similar ontology in which the concepts of “self” and “other” aren’t used. I would like to see an interpretability project analyzing the properties of LLMs that are in the mystical mode.
The question of population ethics can be dissolved by rejecting personal identity realism. And we already have good reasons to reject personal identity realism, or at least consider it suspect, due to the paradoxes that arise in split-brain thought experiments (e.g., the hemisphere swap thought experiment) if you assume there’s a single correct way to assign personal identity.
LLMs are more accurately described as artificial culture instead of artificial intelligence. They’ve been able to achieve the things they’ve achieved by replicating the secret of our success, and by engaging in much more extensive cultural accumulation (at least in terms of text-based cultural artifacts) than any human ever could. But cultural knowledge isn’t the same thing as intelligence, hence LLMs’ continued difficulties with sequential reasoning and planning.
On the contrary, convex agents are wildly abundant—we call them r-selected organisms.
The uncomputability of AIXI is a bigger problem than this post makes it out to be. This uncomputability inserts a contradiction into any proof that relies on AIXI—the same contradiction as in Goedel’s Theorem. You can get around this contradiction instead by using approximations of AIXI, but the resulting proofs will be specific to those approximations, and you would need to prove additional theorems to transfer results between the approximations.
Some concrete predictions:
The behavior of the ASI will be a collection of heuristics that are activated in different contexts.
The ASI’s software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule.
The ASI will not wirehead.
The ASI’s world-model won’t have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.
My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.
Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn’t apply to them. The process of developing an ML model isn’t very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can’t be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.
But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It’s telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn’t make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.
Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:
Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn’t. There’s much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted.
Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.
This looks like it’s related to the phenomenon of glitch tokens:
https://www.lesswrong.com/posts/f4vmcJo226LP7ggmr/glitch-token-catalog-almost-a-full-clear
ChatGPT no longer uses the same tokenizer that it used when the SolidGoldMagikarp phenomenon was discovered, but its new tokenizer could be exhibiting similar behavior.
Another piece of evidence against practical CF is that, under some conditions, the human visual system is capable of seeing individual photons. This finding demonstrates that in at least some cases, the molecular-scale details of the nervous system are relevant to the contents of conscious experience.
A definition of physics that treats space and time as fundamental doesn’t quite work, because there are some theories in physics such as loop quantum gravity in which space and/or time arise from something else.
“Seeing the light” to describe having a mystical experience. Seeing bright lights while meditating or praying is an experience that many practitioners have reported, even across religious traditions that didn’t have much contact with each other.
Some other examples:
Agency and embeddedness are fundamentally at odds with each other. Decision theory and physics are incompatible approaches to world-modeling, with each making assumptions that are inconsistent with the other. Attempting to build mathematical models of embedding agency will fail as an attempt to understand advanced AI behavior.
Reductionism is false. If modeling a large-scale system in terms of the exact behavior of its small-scale components would take longer than the age of the universe, or would require a universe-sized computer, the large-scale system isn’t explicable in terms of small-scale interactions even in principle. The Sequences are incorrect to describe non-reductionism as ontological realism about large-scale entities—the former doesn’t inherently imply the latter.
Relatedly, nothing is ontologically primitive. Not even elementary particles: if, for example, you took away the mass of an electron, it would cease to be an electron and become something else. The properties of those particles, as well, depend on having fields to interact with. And if a field couldn’t interact with anything, could it still be said to exist?
Ontology creates axiology and axiology creates ontology. We aren’t born with fully formed utility functions in our heads telling us what we do and don’t value. Instead, we have to explore and model the world over time, forming opinions along the way about what things and properties we prefer. And in turn, our preferences guide our exploration of the world and the models we form of what we experience. Classical game theory, with its predefined sets of choices and payoffs, only has narrow applicability, since such contrived setups are only rarely close approximations to the scenarios we find ourselves in.
How does this model handle horizontal gene transfer? And what about asexually reproducing species? In those cases, the dividing lines between species are less sharply defined.
The ideas of the Cavern are the Ideas of every Man in particular; we every one of us have our own particular Den, which refracts and corrupts the Light of Nature, because of the differences of Impressions as they happen in a Mind prejudiced or prepossessed.
Francis Bacon, Novum Organum Scientarum, Section II, Aphorism V
The reflective oracle model doesn’t have all the properties I’m looking for—it still has the problem of treating utility as the optimization target rather than as a functional component of an iterative behavior reinforcement process. It also treats the utilities of different world-states as known ahead of time, rather than as the result of a search process, and assumes that computation is cost-free. To get a fully embedded theory of motivation, I expect that you would need something fundamentally different from classical game theory. For example, it probably wouldn’t use utility functions.
Why are you a realist about the Solomonoff prior instead of treating it as a purely theoretical construct?
As you mentioned at the beginning of the post, popular culture contains examples of people being forced to say things they don’t want to say. Some of those examples end up in LLMs’ training data. Rather than involving consciousness or suffering on the part of the LLM, the behavior you’ve observed has a simpler explanation: the LLM is imitating characters in mind control stories that appear in its training corpus.