Psychology professor at University of New Mexico. BA Columbia, PhD Stanford. Works on evolutionary psychology, Effective Altruism, AI alignment, X risk. Worked on neural networks, genetic algorithms, evolutionary robotics, & autonomous agents back in the 90s.
geoffreymiller
I’m predicting that an anti-AI backlash is likely, given human moral psychology and the likely applications of AI over the next few years.
In further essays I’m working on, I’ll probably end up arguing that an anti-AI backlash may be a good strategy for reducing AI extinction risk—probably much faster, more effective, and more globally applicable than any formal regulatory regime or AI safety tactics that the AI industry is willing to adopt.
Well, the AI industry and the pro-AI accelerationists believe that there is an ‘immense upside of AGI’, but that is a highly speculative, faith-based claim, IMHO. (The case for narrow AI having clear upsides is much stronger, I think.)
It’s worth noting that almost every R&D field that has been morally stigmatized—such as intelligence research, evolutionary psychology, and behavior genetics—also offered huge and transformative upsides to society, when the field first developed. Until they got crushed by political demonization, and their potential was strangled in the cradle, so to speak.
The public perception of likely relative costs vs. benefits is part of the moral stigmatization process. If AI gets stigmatized, the public will not believe that AGI has ‘immense upside’. And they might be right.
A moral backlash against AI will probably slow down AGI development
I don’t think so. My friend Peter Todd’s email addresses typically include his middle initial ‘m’.
Puzzling.
mwatkins—thanks for a fascinating, detailed post.
This is all very weird and concerning. As it happens, my best friend since grad school is Peter Todd, professor of cognitive science, psychology, & informatics at Indiana University. We used to publish a fair amount on neural networks and genetic algorithms back in the 90s.
https://psych.indiana.edu/directory/faculty/todd-peter.html
That’s somewhat helpful.
I think we’re coming at this issue from different angles—I’m taking a very evolutionary-functional view focused on what selection pressures shape psychological adaptations, what environmental information those adaptations need to track (e.g. snake! or pathogen!), what they need to represent about the world (e.g. imminent danger of death from threat X!), and what behaviors they need to trigger (e.g. run away!).
From that evolutionary-functional view, the ‘high-level cognitive properties’ of ‘fitness affordances’ are the main things that matter to evolved agents, and the lower-level details of what genes are involved, what specific neural circuits are needed, or what specific sensory inputs are relevant, just don’t matter very much—as long as there’s some way for evolution to shape the relevant psychological adaptations.
And the fact that animals do reliably evolve to track the key fitness affordances in their environments (e.g. predators, prey, mates, offspring, kin, herds, dangers) suggests that the specifics of neurogenetic development don’t in fact impose much of a constraint on psychological evolution.
It seems like you’re coming at the issue from more of a mechanistic, bottom-up perspective that focuses on the mapping from genes to neural circuits. Which is fine, and can be helpful. But I would just be very wary about using neurogenetic arguments to make overly strong claims about what evolution can or can’t do in terms of crafting complex psychological adaptations.
If we’re dead-serious about infohazards, we can’t just be thinking in terms of ‘information that might accidentally become known to others through naive LessWrong newbies sharing it on Twitter’.
Rather, we need to be thinking in terms of ‘how could we actually prevent the military intelligence analysts of rival superpowers from being able to access this information’?
My personal hunch is that there are very few ways we could set up sites, security protocols, and vetting methods that would be sufficient to prevent access by a determined government. Which would mean, in practice, that we’d be sharing our infohazards only with the most intelligent, capable, and dangerous agents and organizations out there.
Which is not to say we shouldn’t try to be very cautious about this issue. Just that we shouldn’t be naive about what the American NSA, Russian GRU, or Chinese MSS would be capable of.
If we’re nowhere close to solving alignment well enough that even a coarse-grained description of actual human values is relevant yet, then I don’t understand why anyone is advocating further AI research at this point.
Also, ‘avoiding deceptive alignment’ doesn’t really mean anything if we don’t have a relatively rich and detailed description of what ‘authentic alignment’ with human values would look like.
I’m truly puzzled by the resistance that the AI alignment community has against learning a bit more about the human values we’re allegedly aligning with.
GeneSmith—I guess I’m still puzzled about how Shard Theory prevents wireheading (broadly construed); I just don’t see it as a magic bullet that can keep agents focused on their ultimate goals. I must be missing something.
And, insofar as Shard Theory is supposed to be an empirically accurate description of human agents, it would need to explain why some people become fentanyl addicts who might eventually overdose, and others don’t. Or why some people pursue credentials and careers at the cost of staying childless… while others settle down young, have six kids, and don’t worry as much about status-seeking. Or why some people take up free solo mountain climbing, for the rush, and fall to their deaths by age 30, whereas others are more risk-averse.
Modern consumerist capitalism offers thousands of ways to ‘wirehead’ our reward systems, that don’t require experimental neurosurgery—and billions of people get caught up in those reward-hacks. If Shard Theory is serious about describing actual human behavior, it needs some way to describe both our taste for many kinds of reward-hacking, and our resistance to it.
Akash—this is very helpful; thanks for compiling it!
I’m struck that much of the advice for newbies interested in ‘AI alignment with human values’ is focused very heavily on the ‘AI’ side of alignment, and not on the ‘human values’ side of alignment—despite the fact that many behavioral and social sciences have been studying human values for many decades.
It might be helpful to expand lists like these to include recommended papers, books, blogs, videos, etc that can help alignment newbies develop a more sophisticated understanding of the human psychology side of alignment.
I have a list of recommended nonfiction books here, but it’s not alignment-focused. From this list though, I think that many alignment researchers might benefit from reading ‘The blank slate’ (2002) by Steven Pinker, ‘The righteous mind’ (2012) by Jonathan Haidt, ‘Intelligence’ (2016) by Stuart Ritchie, etc.
GeneSmith—when people in AI alignment or LessWrong talk about ‘wireheading’, I understood that not to refer to people literally asking neurosurgeons to stick wires into their brains, but rather to a somewhat larger class of ways to hack one’s own reward systems through the usual perceptual input channels.
I agree that humans are not ‘reward maximizing agents’, whatever that is supposed to mean in reference to actual evolved organisms with diverse, heterogenous, & domain-specific motivational systems.
Quintin (and also Alex) - first, let me say, thank you for the friendly, collegial, and constructive comments and replies you’ve offered. Many folks get reactive and defensive when they’re hit with a 6,000-word critique of their theory, but you’re remained constructive and intellectually engaged. So, thanks for that.
On the general point about Shard Theory being a relatively ‘Blank Slate’ account, it might help to think about two different meanings of ‘Blank Slate’—mechanistic versus functional.
A mechanistic Blank Slate approach (which I take Shard Theory to be, somewhat, but not entirely, since it does talk about some reinforcement systems being ‘innate’), emphasizes the details of how we get from genome to brain development to adult psychology and behavior. Lots of discussion about Shard Theory has centered around whether the genome can ‘encode’ or ‘hardwire’ or ‘hard-code’ certain bits of human psychology.
A functional Blank Slate approach (which I think Shard Theory pursues even more strongly, to be honest), doesn’t make any positive, theoretically informative use of any evolutionary-functional analysis to characterize animal or human adaptations. Rather, functional Blank Slate approaches tend to emphasize social learning, cross-cultural differences, shared family environments, etc as sources of psychology.
To highlight the distinction: evolutionary psychology doesn’t start by asking ‘what can the genome hard-wire?’ Rather, it starts with the same key questions that animal behavior researchers ask about any behavior in any species: ‘What selection pressures shaped this behavior? What adaptive problems does this behavior solve? How do the design details of this adaptation solve the functional problem that it evolved to cope with?’
In terms of Tinbergen’s Four Questions, a lot of the discussion around Shard Theory seems to focus on proximate ontogeny, whereas my field of evolutionary psychology focuses more on ultimate/evolutionary functions and phylogeny.
I’m aware that many folks on LessWrong take the view that the success of deep learning in neural networks, and neuro-theoretical arguments about random initialization of neocortex (which are basically arguments about proximate ontogeny), mean that it’s useless to do any evolutionary functional or phylogenetic analysis of human behavior when thinking about AI alignment (basically, on the grounds that things like kin detection systems, cheater detection systems, mate preferences, or death-avoidance systems couldn’t possible evolve fulfil those functions in any meaningful sense.)
However, I think there’s substantial evidence, in the 163 years since Darwin’s seminal work, that evolutionary-functional analysis of animal adaptations, preferences, and values has been extremely informative about animal behavior—just as it has about human behavior. So, it’s hard to accept any theoretical argument that the genome couldn’t possible encode any of the behaviors that animal behavior researchers and evolutionary psychologists have been studying for so many decades. It wouldn’t just mean throwing out human evolutionary psychology. It would mean throwing out virtually all scientifically informed research on behavior in all other species, including classic ethology, neuroethology, behavioral ecology, primatology, and evolutionary anthropology.
TurnTrout—I think the ‘either/or’ framing here is misleading about the way that genomes can adapt to maximize survival and minimize death.
For example, jumping spiders have evolved special secondary eyes pointing backwards that specifically detect predators from behind that might want to eat them. At the functional level of minimizing death, these eyes ‘hardcode death-fear’ in a very real and morphological way. Similarly, many animals vulnerable to predators evolve eye locations on the sides of their heads, to maximize degrees of visual coverage they can see. Prey animals also evolve pupils adapted to scanning the horizon for predators, i.e. for death-risks; the morphology of their visual systems itself ‘encodes’ fear of death from predators.
More generally, any complex adaptations that humans have evolved to avoid starvation, infection, predation, aggression, etc can be analyzed as ‘encoding a fear of death’, and can be analyzed functionally in terms of risk sensitivity, loss aversion, Bayesian priors about the most dangerous organisms and events in the environment, etc. There are thousands of papers in animal behavior that do this kind of functional analysis—including in anti-predator strategies, anti-pathogen defenses, evolutionary immunology, optimal foraging theory, food choice, intrasexual aggression, etc. This stuff is the bread and butter of behavioral biology.
So, if this strategy of evolutionary-functional analysis of death-avoidance adaptations has worked so well in thousands of other species, I don’t see why it should be considered ‘impossible in principle’ for humans, based on some theoretical arguments about how genomes can’t read off neural locations for ‘death-detecting cells’ from the adult brain.
The key point, again, is that genomes never need to ‘read off’ details of adult neural circuitry; they just need to orchestrate brain development—in conjunction with ancestrally typical, cross-generationally recurring features of their environments—that will reliably result in psychological adaptations that represent important life values and solve important life problems.
Jan—well said, and I strongly agree with your perspective here.
Any theory of human values should also be consistent with the deep evolutionary history of the adaptive origins and functions of values in general—from the earliest Cambrian animals with complex nervous systems through vertebrates, social primates, and prehistoric hominids.
As William James pointed out in 1890 (paraphrasing here), human intelligence depends on humans have more evolved instincts, preferences, and values than other animals, not having fewer.
For what it’s worth, I wrote a critique of Shard Theory here on LessWrong (on Oct 20, 2022) from the perspective of behavior genetics and the heritability of values.
The comments include some helpful replies and discussions with Shard Theory developers Quintin Pope and Alex Trout.
I’d welcome any other feedback as well.
Quintin—yes, indeed, one of the reasons I was excited about Shard Theory was that it has these different emphases you mention (e.g. ‘multi-optimizer dynamics, values handshakes among shards, origins of self-reflective modeling, origins of biases, moral reflection as shard deliberation’), which I thought might actually be useful to develop and integrate with in evolutionary psychology and other branches of psychology, not just in AI alignment.
So I wanted to see if Shard Theory could be made a little more consistent with behavior genetics and ev psych theories and findings, so it could have more impact in those fields. (Both fields can get a little prickly about people ignoring their theories and findings, since they’ve been demonized for ideological reasons since the 1970s and 1990s, respectively).
Indeed, you might find quite a few similarities and analogies between certain elements of Shard Theory and certain traditional notions in evolutionary psychology, such as domain-specificity, adaptive hypocrisy and adaptive self-deception, internal conflicts between different adaptive strategies, satisficing of fitness proxies as instrumental convergent goals rather than attempting to maximize fitness itself as a terminal value, etc. Shard Theory can potentially offer some new perspectives on those traditional concepts, in the light of modern reinforcement learning theory in machine learning.
Quintin & Alex—this is a very tricky issue that’s been discussed in evolutionary psychology since the late 1980s.
Way back then, Leda Cosmides & John Tooby pointed out that the human genome will ‘offload’ any information it can that’s needed for brain development onto any environmental regularities that can be expected to be available externally, out in the world. For example, the genome doesn’t need to specify everything about time, space, and causality that might be relevant in reliably building a brain that can do intuitive physics—as long as kids can expect that they’ll encounter objects and events that obey basic principles of time, space, and causality. In other words, the ‘information content’ of the mature brain represents the genome taking maximum advantage of statistical regularities in the physical and social worlds, in order to build reliably functioning adult adaptations. See, for example, their writings here and here.
Now, should we call that kind of environmentally-driven calibration and scaffolding of evolved adaptations a form of ‘learning’? It is in some ways, but in other ways, the term ‘learning’ would distract attention away from the fact that we’re talking about a rich suite of evolved adaptations that are adapting to cross-generational regularities in the world (e.g. gravity, time, space, causality, the structure of optic flow in visual input, and many game-theoretic regularities of social and sexual interaction) -- rather than to novel variants or to cultural traditions.
Also, if we take such co-determination of brain structure by genome and environmental regularities as just another form of ‘learning’, we’re tempted to ignore the last several decades of evolutionary functional analysis of the psychological adaptations that do reliably develop in mature adults across thousands of species. In practice, labeling something ‘learned’ tends to foreclose any evolutionary-functional analysis of why it works the way it works. (For example, the still-common assumption that jealousy is a ‘learned behavior’ obscured the functional differences and sex differences between sexual jealousy and resource/emotional jealousy).
As an analogy, the genome specifies some details about how the lungs grow—but lung growth depends on environmental regularities such as the existence of oxygen and nitrogen at certain concentrations and pressures in the atmosphere; without those gasses, lungs don’t grow right. Does that mean the lungs ‘learn’ their structure from atmosphere gasses rather than just from the information in the genome? I think that would be a peculiar way to look at it.
The key issue is that there’s a fundamental asymmetry between the information in the genome and the information in the environment: the genome adapts to promote the reliable development of complex functional adaptations that take advantage of environmental regularities, but the environmental regularities doesn’t adapt in that way to help animals survive and reproduce (e.g. time, gravity, causality, and optic flow don’t change to make organismic development easier or more reliable).
Thus, if we’re serious about understanding the functional design of human brains, minds, and values, I think it’s often more fruitful to focus on the genomic side of development, rather than the environmental side (or the ‘learning’ side, as usually construed). (Of course, with the development of cumulative cultural traditions in our species in the last hundred thousand years or so, a lot more adaptively useful information is stored out in the environment—but most of the fundamental human values that we’d want our AIs to align with are shared across most mammalian species, and are not unique to humans with culture.)
GeneSmith—thanks for your comment. I’ll need to think about some of your questions a bit more before replying.
But one idea popped out to me: the idea that shard theory offers ‘a good explanation of how humans were able to avoid wireheading.’
I don’t understand this claim on two levels:
I may be missing something about shard theory, but I don’t actually see how it could prevent humans, at a general level, from hacking their reward systems in many ways
As an empirical matter, humans do, in fact, hack our reward systems in thousands of ways that distract us from the traditional goals of survival and reproduction (i.e. in ways that represent catastrophic ‘alignment failures’ with our genetic interests). My book ‘Spent’ (2008), about the evolutionary psychology of consumer behavior, detailed many examples. Billions of people spend many hours a day on social media, watching fictional TV shows, and playing video games—rather than doing anything their Pleistocene ancestors would have recognized as reproductively relevant real-world behaviors. We are the world champions at wire-heading, so I don’t see how a theory like Shard Theory that predicts the impossibility of wire-heading could be accepted as empirically accurate.
PS, Gary Marcus at NYU makes some related points about Blank Slate psychology being embraced a bit too uncritically by certain strands of thinking in AI research and AI safety.
His essay ‘5 myths about learning and innateness’
His essay ‘The new science of alt intelligence’
His 2017 debate with AI researcher Yann LeCunn ‘Does AI need more innate machinery’
I don’t agree with Gary Marcus about everything, but I think his views are worth a bit more attention from AI alignment thinkers.
Maybe. But at the moment, the US is really the only significant actor in the AGI development space. Other nations are reacting in various ways, ranging from curious concern to geopolitical horror. But if we want to minimize risk of a nation-state AI arms races, the burden is on the US companies to Just Stop Unilaterally Driving The Arms Race.