Let’s talk about things like: sense-of-fairness, sense-of-justice, status-seeking, pride, defensiveness, guilt, revenge, schadenfreude, affection, generosity, in-group signaling, etc.
I claim that all those things stem from a suite of innate reactions hardcoded by the genome, and I call those things “social instincts”.
Status-seeking likely emerges from empowerment and social dynamics, guilt is likewise just emergent regret from altruism/empathy, affection/generosity are just manifestations of altruism/empathy. Fairness/justice/revenge/anger are all likely just manifestations of the same core emotion interacting with theory of mind (injustice triggers anger, and revenge is the consequentialist endpoint of anger). In other words, I’m not claiming there aren’t any innate hardcoded emotional circuits—obviously there are - just that there are less truly innate then you posit, and instead most emerge from learning with a smaller simpler set of innate primal drives/emotions.
People don’t seek revenge because they figured out earlier in life that revenge would be instrumentally useful; they seek revenge because they feel a burning desire for revenge. Right?
Revenge is simply planning under the influence of anger/wrath. The anger/injustice emotional circuity is innate, and so is planning, so humans don’t need to learn to plan while predominantly angry, but they do need to learn to map those emergent mental behaviors to the word ‘revenge’.
So I think these things come from specific genetically-hardwired circuitry setting up specific drives.
Sure we agree there, I just don’t think there are as many or as complex innate sub components as you are positing.
If we want an AI to succeed at inventing solar cells or whatever, and we don’t care whether the AI can thrive in small highly-social groups of hunter-gatherer humans, then it seems to me that the AI does not require (or even benefit from) things like revenge drives and sense-of-fairness drives.
Sure we don’t need the justice/anger emotional subsystem, or the mating specific components, but we still want the equivalent of empathy/altruism.
I think humans have a bunch of partially-redundant innate mechanisms in the brainstem to flag other humans in the vicinity. There’s good evidence for an innate face-detector in the brainstem. I strongly suspect that the brainstem also has a human-speech-sound detector
It’s debatable how important (vs vestigial) many of these innate detectors are in humans, but they certainly don’t seem to be very important/necessary for AGI. They were likely far more important for smaller brained and shorter lived mammalian ancestors.
Then if we keep the same reward function and train an AGI with humans around, will it treat humans as in-group AGIs, or out-group AGIs, or things-to-which-social-instincts-do-not-apply, or what?
If the AGI grows up in simulations that descend from modern game-tech with realistic humans, it would be pretty wierd if that somehow didn’t transfer to recognizing humans as sapients (especially given how humans have no problem recognizing agents in the shape of animals or inanimate objects as sapients). This is relevant because simulations will likely continue to be the dominate most effective means of testing/evaluating/developing AI/AGI.
The main thing I had originally wanted to push back on was your earlier claim “This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false”.
But later IIUC you said “The anger/injustice emotional circuity is innate” and that a practical generic learning agent does not need that circuitry. (If so, I agree.)
If I’m understanding you correctly, you also think that altruism/empathy also involves purpose-built innate circuitry, and that we can make a practical generic learning agent without that altruism/empathy circuitry, and it would still be competent (e.g. able to invent a better solar cell), but the people who make AGIs will in fact choose to put that altruism/empathy circuitry in. (If so, I agree that people will want to put that circuitry in, but I’m concerned that they will not know how to put it in, and I’m also concerned that people will do dangerous experiments where they omit that circuitry just to see what happens etc.)
I find it hard to reconcile the claims “This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false” versus “It is perfectly possible to build a practical, powerful learning agent with neither anger/injustice emotional circuitry nor altruism/empathy emotional circuitry.” Those seem contradictory, right? Am I misunderstanding something?
The other side of my mildly-anti-anthropomorphism argument: I think it’s possible that we will make an AGI with things inside its within-lifetime reward function that give it one or more “innate drives” that are radically different from any of the innate drives in humans, e.g. an “innate drive” for making paperclips analogous to the human innate drive for not being in pain. My impression is that you think this won’t happen, but I’m not sure if that’s because you think it’s impossible / nonsensical, or because you think that the people who make AGIs will successfully avoid putting in drives like that.
(My belief is that “people will make AGIs with innate drives that are different from any of the innate drives in humans” is both possible and likely to actually happen, unless we put great effort into developing best practices for safe AGI design, and future AGI designers actually follow those best practices.)
I’m not claiming there aren’t any innate hardcoded emotional circuits—obviously there are—just that there are less truly innate then you posit, and instead most emerge from learning with a smaller simpler set of innate primal drives/emotions.
I don’t have a strong opinion about how complex are the “innate primal drives / emotions” that underlie human social instincts. In particular, I’m open-minded to the possibility that there’s one innate reaction circuit that underlies (what we think of as) schadenfraude and revenge and pride etc., or whatever.
Well, hmm, maybe “open-minded but leaning skeptical”. For example, I think humans have an innate eye-contact detector in the brainstem that triggers some set of corresponding reactions. I think that’s a thing with dedicated innate circuitry. I also think “disgust” is its own dedicated thing in the brainstem—actually, I heard there are two slightly-different innate disgust reactions, associated with slightly-different facial expressions—and disgust reactions wind up playing a role in social emotions too. Anyway, various things like that make me skeptical that there’s a simple “grand unified theory of human social emotions”.
Well, maybe it depends on how accurate we’re talking about. Maybe we can list all the human innate reaction circuits, in descending order of importance for human social emotions, and maybe the top one or two or three things would be sufficient to reproduce all the most salient and important phenomena in human social instincts, and maybe the other 500 things further down the list are all kinda subtle details that don’t add up to much. I’m very open-minded to that possibility.
debatable how important (vs vestigial) many of these innate detectors are in humans
My current belief (see the blog post draft #3 that I shared with you a couple weeks ago) is that the simplest within-lifetime reward function for a powerful AGI consists of (1) some kind of curiosity drive, (2) some kind of drive to pay attention to humans, including human language.
Your list half-overlaps with mine. IIUC, you have (1) some kind of curiosity drive, (2*) “empowerment” drive. Did I get that right?
Why do I think (2) is important? Because a curious agent can be curious about anything—it can construct better and better models of trees, or clouds, or the shape and distribution of pebbles, etc. Granted, human language is an endlessly-complex pattern that might evoke curiosity … but the agent could also run Rule 110 in its head forever and also find endlessly-complex patterns that might evoke curiosity. So I think (2) is necessary to point the curiosity drive in the right general direction. This is why I put a lot of emphasis on those innate face detectors, human-speech-sound detectors, etc.
Why do I think (2*) is not important? (With the caveat that maybe I’m misunderstanding what you mean by empowerment.) Because we can get empowerment from curiosity, through means-end instrumental reasoning within a lifetime.
I also think that the dynamic in humans is “drive for status → drive for empowerment”, rather than the other way around. (You can also get drive for empowerment from almost any other drive.) I think “drive for status” is a beautiful explanation of tons of things, not all of which are explainable via drive for empowerment, and that analogous status hierarchies / status drives exist in other animals too like the Arabian babbler (see Elephant in the Brain).
The main thing I had originally wanted to push back on was your earlier claim “This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false”.
I take values to mean longer term goals like “save the world”, or “become super successful/rich/powerful” or “do God’s work” or whatever, not low level drives or emotions.
If I’m understanding you correctly, you also think that altruism/empathy also involves purpose-built innate circuitry, and that we can make a practical generic learning agent without that altruism/empathy circuitry, and it would still be competent (e.g. able to invent a better solar cell), but the people who make AGIs will in fact choose to put that altruism/empathy circuitry in. (If so, I agree that people will want to put that circuitry in, but I’m concerned that they will not know how to put it in, and I’m also concerned that people will do dangerous experiments where they omit that circuitry just to see what happens etc.)
Yeah. The ‘altruism/empathy’ circuit obviously has some innateness, but it is closely connected to and reliant on learned theory of mind. Sadly not all researchers seem to even care to put something like that in, although they should try. How that subsystem interacts with learned theory of mind is also complex, and is probably more inherently fragile than the generic unsupervised empowerment/curiosity learning system. It may be difficult to scale correctly, even for those who are bothering to try.
I find it hard to reconcile the claims “This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false” versus “It is perfectly possible to build a practical, powerful learning agent with neither anger/injustice emotional circuitry nor altruism/empathy emotional circuitry.” Those seem contradictory, right? Am I misunderstanding something?
What I’m considering values here are almost exclusively learned (and typically social/cultural) concepts. Truly alien values would require creating a de novo alien cultural history (possible in sims but unlikely until later).
I think it’s possible that we will make an AGI with things inside its within-lifetime reward function that give it one or more “innate drives” that are radically different from any of the innate drives in humans, e.g. an “innate drive” for making paperclips analogous to the human innate drive for not being in pain
I think this unlikely as these just simply aren’t useful for AGI in complex environments. Simple innate drives (score reward) barely work in Atari (and not even for all games). Moving to more complex environments requires some form of intrinsic-motivation (empowerment/curiosity/etc), which is both necessary, sufficient, and strictly dominate/superior.
Your list half-overlaps with mine. IIUC, you have (1) some kind of curiosity drive, (2*) “empowerment” drive. Did I get that right? Why do I think (2) is important?
I lump empowerment/curiosity together, as they are both candidates for intrinsic-motivation learning, and I’m currently unsure what is the best model for human learning (some data from Atari and Minecraft indicates info-gain is a better fit than empowerment, and ‘input entropy’ is somewhat better than info-gain[1], although this may be specific to their approximation of empowerment). Regardless either is universal because of instrumental convergence, so empowerment can lead to curiosity and vice versa, but last I checked artificial curiosity still had some edge case issues.
I think “drive for status” is a beautiful explanation of tons of things, not all of which are explainable via drive for empowerment,
That seems pretty unlikely. Empowerment (or intrinsic-motivated learning) is fully universal/generic, simple, and is fully sufficient to explain drive for status, but the converse is not true. Empowerment explains play and early learning in children, how humans play novel games, why we very rapidly learn the value of money, drive for status, etc.
Yeah. The ‘altruism/empathy’ circuit obviously has some innateness, but it is closely connected to and reliant on learned theory of mind. Sadly not all researchers seem to even care to put something like that in, although they should try. How that subsystem interacts with learned theory of mind is also complex, and is probably more inherently fragile than the generic unsupervised empowerment/curiosity learning system. It may be difficult to scale correctly, even for those who are bothering to try.
Strong agree
I think this unlikely as these just simply aren’t useful for AGI in complex environments. Simple innate drives (score reward) barely work in Atari (and not even for all games). Moving to more complex environments requires some form of intrinsic-motivation (empowerment/curiosity/etc), which is both necessary, sufficient, and strictly dominate/superior.
I’m confused. Suppose AGI developer Alice wants to build an AGI that makes her as much money as possible. I would propose that maybe Alice would try a within-lifetime reward function which is a linear (or nonlinear) combination of (1) curiosity / intrinsic motivation, and (2) reward when Alice’s bank account balance goes up. The resulting AGI would have both an “innate” curiosity drive and an “innate” “make Alice’s-bank-account-balance-go-up” drive. The latter (unlike the former) is very unlike any of the innate drives in humans.
In other words, I’m open to the possibility that some kind of intrinsic motivation is sufficient to make a powerful agent, but the AGI designers don’t just want any powerful agent, they want a powerful agent trying to do something in particular that the AGI designer has in mind. And one obvious way to do so is to put that something-in-particular into the reward function in addition to curiosity / whatever.
I take values to mean longer term goals like “save the world”, or “become super successful/rich/powerful” or “do God’s work” or whatever, not low level drives or emotions.
Oh sure, I normally call that “explicit goals”. I guess maybe your point is that among the 7 billion humans you’ll find such an incredibly diverse collection of explicit goals that it’s hard to imagine an AGI with a goal far outside that span? If so, I guess that’s true, to a point. But I suspect that “maximize paperclips in our future light-cone” would still be an example of something that (to my knowledge) no human in history has ever adopted as an explicit long-term goal. Whereas I think we could make an AGI with that goal.
I’m confused. Suppose AGI developer Alice wants to build an AGI that makes her as much money as possible. I would propose that maybe Alice would try a within-lifetime reward function which is a linear (or nonlinear) combination of (1) curiosity / intrinsic motivation, and (2) reward when Alice’s bank account balance goes up.
Should have clarified, but when I said intrinsic motivation was necessary and sufficient, I meant only for creating powerful (but unaligned AGI). Clearly intrinsic motivation by itself is undesirable—as it’s not aligned—so any reasonable use of intrinsic motivation should always use that as an instrumental ‘boostrap’ motivator, not the sole or final terminal utility.
You could of course use the specific combination of 1.) intrinsic motivation and 2.) account balance reward, but that also sounds pretty obviously disastrous: when the agent surpasses human capability its best route to maximizing 2 and 1 tends to involve taking control of the account, at which point the human becomes irrelevant at best.
Although I agree this agent would be unlike humans in terms of low level innate drives, most of the variance in human actions is explained purely by intrinsic motivation - which would also be true for this agent.
And one obvious way to do so is to put that something-in-particular into the reward function in addition to curiosity / whatever.
Yeah of course—the intrinsic motivation should never be the only/sole component.
But I suspect that “maximize paperclips in our future light-cone” would still be an example of something that (to my knowledge) no human in history has ever adopted as an explicit long-term goal. Whereas I think we could make an AGI with that goal.
So actually I think if you attempt to work out how to implement that (in a powerful AGI), it’s probably as difficult as making approximately aligned AGI. The bank account example is somewhat easier (especially if it’s a cryptocurrency account) as it has a direct external signal.
For paperclipping or intra-agent alignment, the key hard problem is actually the same: balancing intrinsic motivation and some learned model utility criteria under scaling. So I suspect most attempts therein either fail to create powerful AGI, or create powerful AGI that fails to paperclip (or align), and instead just falls into the extremely strong generic power-seeking attractor.
Creating any kind of AGI that is actually powerful is hard, and creating AGI that is both powerful and reliably optimizes long term for any world model concept X other than just power-seeking is especially hard, regardless of what X is.
Learning the world model concepts itself is not the hard part, as powerful AGI already necessarily gives you that. (And in the specific case of human alignment any powerful agent already must learn models of human utility functions as part of learning a powerful world model)
Thanks, this is great, I really feel like we’re converging here. Here’s where I think we stand.
Intrinsic motivation / curiosity:
We both agree that humans have an “intrinsic motivation” drive and that AGI will likewise have an “intrinsic motivation” drive, at least for the early part of training (perhaps it can “fade out” when the AGI is sufficiently smart and self-aware, such that instrumental convergence can substitute for intrinsic motivation?). I’m calling the intrinsic motivation “curiosity”, and I’m punting on the details of how it works. You’re calling it “curiosity / empowerment”, and apparently have something very specific in mind.
I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a “drive to pay attention to humans”, which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech. Without that drive, I think the curiosity would be completely undirected, and you could wind up with an AGI that ignores the world and spends forever running Rule 110 in its head and finding its increasingly-complicated patterns, or studying the coloration of pebbles, etc. Whereas I think you disagree, and you think that “intrinsic motivation”, properly implemented, will automatically point itself at the world and technology and humans etc., and not at patterns-in-rule-110.
We also disagree about “drive for having high social status / impressing my friends”: You think it’s purely a special case of “intrinsic motivation” and thus requires no further explanation, I think it comes at least in part from “social instincts”, i.e. low-level drives that evolved in humans specifically because we are social animals.
I’m not immediately sure how to move forward in resolving either of those. I think you said you were going to have a post explaining more about how you think intrinsic motivation works, so maybe I’ll just wait for that.
Other low-level drives:
I think we agree that humans have some “social” low-level drives like “altruism / empathy” and “justice/anger” (which I’d call a subset of “social instincts”). We might be disagreeing about how complicated social instincts are (e.g. “how many low-level drives”), with me saying they’re probably pretty complicated and you saying they’re simple. But it’s also possible that we’re not disagreeing at all, but rather answering different questions, i.e. “the main aspects of human social instincts” versus “human social instincts in exact detail including subtle mood-shifts based on how somebody smells” or whatever.
I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they’re implemented, (2) advocacy etc. to convince AGI designers to actually put in whatever social instincts we think they ought to put in.
I think we also agree that AGI can have low-level drives very different from any of the low-level drives in humans, like a low-level drive to get a high score in PacMan—not as a means to an end, but rather because the PacMan score is directly baked into the innate within-lifetime reward function. I think you’re inclined to emphasize that most of these possible low-level drives would be terribly dangerous, and I’m inclined to emphasize that future AGI designers might put them in anyway.
Explicit goals:
I think we agree that humans, combining their modestly-heterogeneous innate drives (e.g. psychopaths, people with autism, etc.) with modestly-heterogeneous training data (a.k.a. life history), can wind up pursuing an insane variety of explicit goals, like the guy trying to set a world record for longest time spent bathing in ice-water, etc. etc. So the claim “the AGI may wind up pursuing goals radically unlike humans” is less clear-cut than it sounds. OTOH, “the AGI may wind up pursuing explicit goals unlike typical humans in my culture” is a weaker statement, and I think definitely true. I would even say the stronger thing—that it is in fact possible for a future AGI to wind up pursuing an explicit goal that none of the 100 billion humans in history have ever pursued, e.g. maximizing the quantity of solar cells in the future light-cone, particularly if the AGI is programmed to have a low-level innate drive that no human has ever had, and if AGI designers don’t really know what they’re doing.
Where does that leave anthropomorphism?
When I think of anthropomorphism I have a negative association because I’m thinking of things like my comment here, where somebody was claiming that AGI isn’t dangerous because if an AGI just thought hard enough about it, it would conclude that acting honorably is inherently good and hurting people is inherently bad, because after all, that’s just the way it is. From my perspective, this is problematic anthropomorphism because the process of moral reasoning involves (among other things) queries to low-level “social instincts” drives (especially related to altruism and justice), and whoever builds the AGI won’t necessarily put in the same “social instincts” drives that humans have.
(I could have also pointed out that high-functioning sociopaths often have a very good understanding of honor etc. but not find those things motivating at all. Maybe that’s a general rule: if we see an “anthropomorphism” argument that really only applies to neurotypical people, and not to psychopaths and people with autism etc., then that’s a giant red flag.)
Anyway, when you think of anthropomorphism, it seems that your mind immediately goes to “humans can sometimes be single-mindedly in pursuit of power, and AGIs also can sometimes be single-mindedly in pursuit of power”, which happens to be a statement I agree with. So you wind up with a positive association.
Couple other things:
You could of course use the specific combination of 1.) intrinsic motivation and 2.) [bank] account balance reward, but that also sounds pretty obviously disastrous
Agree, but only if we define “obviously” as “obviously to me and you”. I still think there’s a good chance that somebody would try.
So actually I think if you attempt to work out how to implement that (in a powerful AGI), it’s probably as difficult as making approximately aligned AGI.
Oh, sorry for bad communication, when I said “I think we could make an AGI with that goal [of maximizing paperclips]”, I should have added “in principle”. Obviously right now we can’t make any AGI whatsoever, and additionally we don’t know how to reliably make the AGI that is trying to do some particular thing that we had in mind. I doubt the problem of making a paperclip maximizer is fundamentally impossible, and I’d be pretty confident that we could eventually figure it out if we wanted to (which we don’t), if only we could survive long enough to do arbitrarily much trial-and-error. :-P
Thanks for the organized reply, i’ll try to keep the same format.
Intrinsic motivation / curiosity:
You are familiar with the serotogenic and dopaminergic pathways and associated learning systems—typically simplified to an unsupervised learning component and a reward learning component.
My main point is that picture is incomplete/incorrect, and the brain’s main learning system involves some form of empowerment. Curiosity is typically formulated as improvement in prediction capability, so it’s like a derivative of more standard unsupervised learning (and thus probably a component of that system). But that alone isn’t so great at learning for the roughly half the brain involved in action/motor/decision/planning. Some form of ‘empowerment’ criteria—specifically maximization of mutual information between actions and future world state (or observations, but the former is probably better) is a more robust general learning signal for action learning, and seems immune to the problems that plague pure curiosity approaches like the rule 101 type issues you mention.
For example: dopamine release on winning a bet has nothing to do with innate drives, it’s purely an empowerment type learning signal. This is actually just the normal learning system at work.
The brain is mostly explained by this core learning system (which perhaps has just two or three main components). The innate drives (hunger,thirst,comfort/pain,sex,etc) are completely insufficient as signals for training the brain. They are instead satisficing drives that quickly saturate. They are secondary learning signals, but moreover they also can directly control/influence behavior in key situations, like the emotional subsystems. (Naturally there are exceptions to typical saturation—humans with a mutation causing perpetual unsatisfiable deep hunger and thus think about food all day long)
Empowerment that operates over learned world state also could support easy modulation—for example by up-weighting the importance of modeling humans/agents.
The altruism/empathic component isn’t really like those innate drives (it’s not really satisfying/saturating), and so instead is more core, part of the primary utility function and learning systems. (And also probably involves it’s own neuromodulator component through oxytocin).
I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a “drive to pay attention to humans”, which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech.
Human infants grow up around humans who spend a large amount of time talking near the child. It’s actually a dominant component of the audio landscape human infants grow up in. Any reasonably competent UL system will learn a model of human speech just from this training data (and ML systems prove this). Any innate human-speech brainstem circuit is of secondary importance—perhaps it speeds up learning a bit (like the simple brainstem face detector that helps prime the cortex), but it simply can not be necessary—as that would be incompatible with everything we know about the powerful universal learning capability of the brain.
Then once the brain has learned a recognition model of human speech, empowerment based learning is completely sufficient to learn speech production motor skills, simply by learning to maximize the mutual information between larynx motor actions and future predicted human speech audio world state. Again the brain may use some tricks to speed up learning, but the universal learning system is doing all the heavy lifting.
We also disagree about “drive for having high social status / impressing my friends”: You think it’s purely a special case of “intrinsic motivation” and thus requires no further explanation,
Once a child has learned a model of other humans—parents, friends, general models of other ‘kids’, etc, the empowerment system naturally then tries to learn ways to control these agents. This is so difficult that it basically drives a huge chunk of subsequent learning for most people, and becomes social theory of mind and innate ‘game theory’. Social status is simply a proxy measure for influence, so it’s closely correlated—or even just the same as—maximization of mutual info between actions and future agent beliefs (ie empowerment). If you think of what the word influence means, it’s actually just a definition of a specific form of empowerment.
Other low-level drives:
The ancient innate Satisficing drives are what I think of as the low-level drive category (hunger,thirst,pain,sex,etc).
And finally the core emotions (happiness, sadness, fear, anger) are a third category. They are ancient subsystems that are both behavioral triggers and learning modulators. Happiness/sadness are just manifestations of predicted utility, whereas fear and anger are innate high-stress behavior modes (flight and fight responses). Humans then inherit more complex triggers—such as the injustice/righteousness triggers for anger, and more complex derived emotions.
I would put altruism/empathy in its own category, although it’s also obviously closely connected to the emotion of love. Implementation wise it results in mixing of the learned utility functions of external agents into the agent’s own root utility function. It is essentially evolved alignment. There are good reasons for this to evolve—basically shared genes and disposable somas, and we’ll want something similar in AGI. It’s a social component in the sense that it needs to connect the learned models of external agents to the core utility function.
I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they’re implemented, (2) advocacy etc.
We want to align AGI, and the brain’s empathic/altruistic system could show us a practical way to achieve that. I don’t see much role for the other emotional circuitry or innate drives. So we mostly agree here except you seem more interested in various ‘social instincts’ beyond just empathy/altruism (alignment).
Where does that leave anthropomorphism?
I believe humans (and more specifically high-impact humans) are mostly explained by a universal/generic learning system optimizing for a few things: mainly some mix of empowerment, curiosity, and altruism/empathy. There are many other brain systems (innate drives, emotions, etc), but they aren’t so relevant.
I also believe brains are efficient, and thus AGI will end up being brain like—specifically it will also be mostly understandable as a universal neural learning system optimizing for some mix of empowerment, curiosity, and altruism/empathy or equivalents. There may be some other components, but they aren’t as important.
Goals and values are complex learned concepts. Initial AGI will not reinvent all of human cultural history, and instead will just absorb human values—as they emerge from a universal learning system training on human world experience data, and AGI will have a similar universal learning system and similar experience training data. This doesn’t imply AGI will have the exact same values of some typical mix of humans. Only that it’s values will be mostly sampled from within the wide human-set.
From the original comment I was replying to (from Jon Garcia, not you):
There is no reason to think that the first AGIs will have goal/value structures any less alien to humans than would a superintelligent spider
There are deep reasons to believe AGI will be more anthropomorphic than not—mostly created in the image of humans. AGI will be much closer to a human mind than some hypothetical superintelligent spider.
Status-seeking likely emerges from empowerment and social dynamics, guilt is likewise just emergent regret from altruism/empathy, affection/generosity are just manifestations of altruism/empathy. Fairness/justice/revenge/anger are all likely just manifestations of the same core emotion interacting with theory of mind (injustice triggers anger, and revenge is the consequentialist endpoint of anger). In other words, I’m not claiming there aren’t any innate hardcoded emotional circuits—obviously there are - just that there are less truly innate then you posit, and instead most emerge from learning with a smaller simpler set of innate primal drives/emotions.
Revenge is simply planning under the influence of anger/wrath. The anger/injustice emotional circuity is innate, and so is planning, so humans don’t need to learn to plan while predominantly angry, but they do need to learn to map those emergent mental behaviors to the word ‘revenge’.
Sure we agree there, I just don’t think there are as many or as complex innate sub components as you are positing.
Sure we don’t need the justice/anger emotional subsystem, or the mating specific components, but we still want the equivalent of empathy/altruism.
It’s debatable how important (vs vestigial) many of these innate detectors are in humans, but they certainly don’t seem to be very important/necessary for AGI. They were likely far more important for smaller brained and shorter lived mammalian ancestors.
If the AGI grows up in simulations that descend from modern game-tech with realistic humans, it would be pretty wierd if that somehow didn’t transfer to recognizing humans as sapients (especially given how humans have no problem recognizing agents in the shape of animals or inanimate objects as sapients). This is relevant because simulations will likely continue to be the dominate most effective means of testing/evaluating/developing AI/AGI.
Thanks!
The main thing I had originally wanted to push back on was your earlier claim “This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false”.
But later IIUC you said “The anger/injustice emotional circuity is innate” and that a practical generic learning agent does not need that circuitry. (If so, I agree.)
If I’m understanding you correctly, you also think that altruism/empathy also involves purpose-built innate circuitry, and that we can make a practical generic learning agent without that altruism/empathy circuitry, and it would still be competent (e.g. able to invent a better solar cell), but the people who make AGIs will in fact choose to put that altruism/empathy circuitry in. (If so, I agree that people will want to put that circuitry in, but I’m concerned that they will not know how to put it in, and I’m also concerned that people will do dangerous experiments where they omit that circuitry just to see what happens etc.)
I find it hard to reconcile the claims “This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false” versus “It is perfectly possible to build a practical, powerful learning agent with neither anger/injustice emotional circuitry nor altruism/empathy emotional circuitry.” Those seem contradictory, right? Am I misunderstanding something?
The other side of my mildly-anti-anthropomorphism argument: I think it’s possible that we will make an AGI with things inside its within-lifetime reward function that give it one or more “innate drives” that are radically different from any of the innate drives in humans, e.g. an “innate drive” for making paperclips analogous to the human innate drive for not being in pain. My impression is that you think this won’t happen, but I’m not sure if that’s because you think it’s impossible / nonsensical, or because you think that the people who make AGIs will successfully avoid putting in drives like that.
(My belief is that “people will make AGIs with innate drives that are different from any of the innate drives in humans” is both possible and likely to actually happen, unless we put great effort into developing best practices for safe AGI design, and future AGI designers actually follow those best practices.)
I don’t have a strong opinion about how complex are the “innate primal drives / emotions” that underlie human social instincts. In particular, I’m open-minded to the possibility that there’s one innate reaction circuit that underlies (what we think of as) schadenfraude and revenge and pride etc., or whatever.
Well, hmm, maybe “open-minded but leaning skeptical”. For example, I think humans have an innate eye-contact detector in the brainstem that triggers some set of corresponding reactions. I think that’s a thing with dedicated innate circuitry. I also think “disgust” is its own dedicated thing in the brainstem—actually, I heard there are two slightly-different innate disgust reactions, associated with slightly-different facial expressions—and disgust reactions wind up playing a role in social emotions too. Anyway, various things like that make me skeptical that there’s a simple “grand unified theory of human social emotions”.
Well, maybe it depends on how accurate we’re talking about. Maybe we can list all the human innate reaction circuits, in descending order of importance for human social emotions, and maybe the top one or two or three things would be sufficient to reproduce all the most salient and important phenomena in human social instincts, and maybe the other 500 things further down the list are all kinda subtle details that don’t add up to much. I’m very open-minded to that possibility.
My current belief (see the blog post draft #3 that I shared with you a couple weeks ago) is that the simplest within-lifetime reward function for a powerful AGI consists of (1) some kind of curiosity drive, (2) some kind of drive to pay attention to humans, including human language.
Your list half-overlaps with mine. IIUC, you have (1) some kind of curiosity drive, (2*) “empowerment” drive. Did I get that right?
Why do I think (2) is important? Because a curious agent can be curious about anything—it can construct better and better models of trees, or clouds, or the shape and distribution of pebbles, etc. Granted, human language is an endlessly-complex pattern that might evoke curiosity … but the agent could also run Rule 110 in its head forever and also find endlessly-complex patterns that might evoke curiosity. So I think (2) is necessary to point the curiosity drive in the right general direction. This is why I put a lot of emphasis on those innate face detectors, human-speech-sound detectors, etc.
Why do I think (2*) is not important? (With the caveat that maybe I’m misunderstanding what you mean by empowerment.) Because we can get empowerment from curiosity, through means-end instrumental reasoning within a lifetime.
I also think that the dynamic in humans is “drive for status → drive for empowerment”, rather than the other way around. (You can also get drive for empowerment from almost any other drive.) I think “drive for status” is a beautiful explanation of tons of things, not all of which are explainable via drive for empowerment, and that analogous status hierarchies / status drives exist in other animals too like the Arabian babbler (see Elephant in the Brain).
I take values to mean longer term goals like “save the world”, or “become super successful/rich/powerful” or “do God’s work” or whatever, not low level drives or emotions.
Yeah. The ‘altruism/empathy’ circuit obviously has some innateness, but it is closely connected to and reliant on learned theory of mind. Sadly not all researchers seem to even care to put something like that in, although they should try. How that subsystem interacts with learned theory of mind is also complex, and is probably more inherently fragile than the generic unsupervised empowerment/curiosity learning system. It may be difficult to scale correctly, even for those who are bothering to try.
What I’m considering values here are almost exclusively learned (and typically social/cultural) concepts. Truly alien values would require creating a de novo alien cultural history (possible in sims but unlikely until later).
I think this unlikely as these just simply aren’t useful for AGI in complex environments. Simple innate drives (score reward) barely work in Atari (and not even for all games). Moving to more complex environments requires some form of intrinsic-motivation (empowerment/curiosity/etc), which is both necessary, sufficient, and strictly dominate/superior.
I lump empowerment/curiosity together, as they are both candidates for intrinsic-motivation learning, and I’m currently unsure what is the best model for human learning (some data from Atari and Minecraft indicates info-gain is a better fit than empowerment, and ‘input entropy’ is somewhat better than info-gain[1], although this may be specific to their approximation of empowerment). Regardless either is universal because of instrumental convergence, so empowerment can lead to curiosity and vice versa, but last I checked artificial curiosity still had some edge case issues.
That seems pretty unlikely. Empowerment (or intrinsic-motivated learning) is fully universal/generic, simple, and is fully sufficient to explain drive for status, but the converse is not true. Empowerment explains play and early learning in children, how humans play novel games, why we very rapidly learn the value of money, drive for status, etc.
Matusch, Brendon, Jimmy Ba, and Danijar Hafner. “Evaluating Agents without Rewards.” arXiv preprint arXiv:2012.11538 (2020). gs-link
Strong agree
I’m confused. Suppose AGI developer Alice wants to build an AGI that makes her as much money as possible. I would propose that maybe Alice would try a within-lifetime reward function which is a linear (or nonlinear) combination of (1) curiosity / intrinsic motivation, and (2) reward when Alice’s bank account balance goes up. The resulting AGI would have both an “innate” curiosity drive and an “innate” “make Alice’s-bank-account-balance-go-up” drive. The latter (unlike the former) is very unlike any of the innate drives in humans.
In other words, I’m open to the possibility that some kind of intrinsic motivation is sufficient to make a powerful agent, but the AGI designers don’t just want any powerful agent, they want a powerful agent trying to do something in particular that the AGI designer has in mind. And one obvious way to do so is to put that something-in-particular into the reward function in addition to curiosity / whatever.
Oh sure, I normally call that “explicit goals”. I guess maybe your point is that among the 7 billion humans you’ll find such an incredibly diverse collection of explicit goals that it’s hard to imagine an AGI with a goal far outside that span? If so, I guess that’s true, to a point. But I suspect that “maximize paperclips in our future light-cone” would still be an example of something that (to my knowledge) no human in history has ever adopted as an explicit long-term goal. Whereas I think we could make an AGI with that goal.
Should have clarified, but when I said intrinsic motivation was necessary and sufficient, I meant only for creating powerful (but unaligned AGI). Clearly intrinsic motivation by itself is undesirable—as it’s not aligned—so any reasonable use of intrinsic motivation should always use that as an instrumental ‘boostrap’ motivator, not the sole or final terminal utility.
You could of course use the specific combination of 1.) intrinsic motivation and 2.) account balance reward, but that also sounds pretty obviously disastrous: when the agent surpasses human capability its best route to maximizing 2 and 1 tends to involve taking control of the account, at which point the human becomes irrelevant at best.
Although I agree this agent would be unlike humans in terms of low level innate drives, most of the variance in human actions is explained purely by intrinsic motivation - which would also be true for this agent.
Yeah of course—the intrinsic motivation should never be the only/sole component.
So actually I think if you attempt to work out how to implement that (in a powerful AGI), it’s probably as difficult as making approximately aligned AGI. The bank account example is somewhat easier (especially if it’s a cryptocurrency account) as it has a direct external signal.
For paperclipping or intra-agent alignment, the key hard problem is actually the same: balancing intrinsic motivation and some learned model utility criteria under scaling. So I suspect most attempts therein either fail to create powerful AGI, or create powerful AGI that fails to paperclip (or align), and instead just falls into the extremely strong generic power-seeking attractor.
Creating any kind of AGI that is actually powerful is hard, and creating AGI that is both powerful and reliably optimizes long term for any world model concept X other than just power-seeking is especially hard, regardless of what X is.
Learning the world model concepts itself is not the hard part, as powerful AGI already necessarily gives you that. (And in the specific case of human alignment any powerful agent already must learn models of human utility functions as part of learning a powerful world model)
Thanks, this is great, I really feel like we’re converging here. Here’s where I think we stand.
Intrinsic motivation / curiosity:
We both agree that humans have an “intrinsic motivation” drive and that AGI will likewise have an “intrinsic motivation” drive, at least for the early part of training (perhaps it can “fade out” when the AGI is sufficiently smart and self-aware, such that instrumental convergence can substitute for intrinsic motivation?). I’m calling the intrinsic motivation “curiosity”, and I’m punting on the details of how it works. You’re calling it “curiosity / empowerment”, and apparently have something very specific in mind.
I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a “drive to pay attention to humans”, which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech. Without that drive, I think the curiosity would be completely undirected, and you could wind up with an AGI that ignores the world and spends forever running Rule 110 in its head and finding its increasingly-complicated patterns, or studying the coloration of pebbles, etc. Whereas I think you disagree, and you think that “intrinsic motivation”, properly implemented, will automatically point itself at the world and technology and humans etc., and not at patterns-in-rule-110.
We also disagree about “drive for having high social status / impressing my friends”: You think it’s purely a special case of “intrinsic motivation” and thus requires no further explanation, I think it comes at least in part from “social instincts”, i.e. low-level drives that evolved in humans specifically because we are social animals.
I’m not immediately sure how to move forward in resolving either of those. I think you said you were going to have a post explaining more about how you think intrinsic motivation works, so maybe I’ll just wait for that.
Other low-level drives:
I think we agree that humans have some “social” low-level drives like “altruism / empathy” and “justice/anger” (which I’d call a subset of “social instincts”). We might be disagreeing about how complicated social instincts are (e.g. “how many low-level drives”), with me saying they’re probably pretty complicated and you saying they’re simple. But it’s also possible that we’re not disagreeing at all, but rather answering different questions, i.e. “the main aspects of human social instincts” versus “human social instincts in exact detail including subtle mood-shifts based on how somebody smells” or whatever.
I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they’re implemented, (2) advocacy etc. to convince AGI designers to actually put in whatever social instincts we think they ought to put in.
I think we also agree that AGI can have low-level drives very different from any of the low-level drives in humans, like a low-level drive to get a high score in PacMan—not as a means to an end, but rather because the PacMan score is directly baked into the innate within-lifetime reward function. I think you’re inclined to emphasize that most of these possible low-level drives would be terribly dangerous, and I’m inclined to emphasize that future AGI designers might put them in anyway.
Explicit goals:
I think we agree that humans, combining their modestly-heterogeneous innate drives (e.g. psychopaths, people with autism, etc.) with modestly-heterogeneous training data (a.k.a. life history), can wind up pursuing an insane variety of explicit goals, like the guy trying to set a world record for longest time spent bathing in ice-water, etc. etc. So the claim “the AGI may wind up pursuing goals radically unlike humans” is less clear-cut than it sounds. OTOH, “the AGI may wind up pursuing explicit goals unlike typical humans in my culture” is a weaker statement, and I think definitely true. I would even say the stronger thing—that it is in fact possible for a future AGI to wind up pursuing an explicit goal that none of the 100 billion humans in history have ever pursued, e.g. maximizing the quantity of solar cells in the future light-cone, particularly if the AGI is programmed to have a low-level innate drive that no human has ever had, and if AGI designers don’t really know what they’re doing.
Where does that leave anthropomorphism?
When I think of anthropomorphism I have a negative association because I’m thinking of things like my comment here, where somebody was claiming that AGI isn’t dangerous because if an AGI just thought hard enough about it, it would conclude that acting honorably is inherently good and hurting people is inherently bad, because after all, that’s just the way it is. From my perspective, this is problematic anthropomorphism because the process of moral reasoning involves (among other things) queries to low-level “social instincts” drives (especially related to altruism and justice), and whoever builds the AGI won’t necessarily put in the same “social instincts” drives that humans have.
(I could have also pointed out that high-functioning sociopaths often have a very good understanding of honor etc. but not find those things motivating at all. Maybe that’s a general rule: if we see an “anthropomorphism” argument that really only applies to neurotypical people, and not to psychopaths and people with autism etc., then that’s a giant red flag.)
Anyway, when you think of anthropomorphism, it seems that your mind immediately goes to “humans can sometimes be single-mindedly in pursuit of power, and AGIs also can sometimes be single-mindedly in pursuit of power”, which happens to be a statement I agree with. So you wind up with a positive association.
Couple other things:
Agree, but only if we define “obviously” as “obviously to me and you”. I still think there’s a good chance that somebody would try.
Oh, sorry for bad communication, when I said “I think we could make an AGI with that goal [of maximizing paperclips]”, I should have added “in principle”. Obviously right now we can’t make any AGI whatsoever, and additionally we don’t know how to reliably make the AGI that is trying to do some particular thing that we had in mind. I doubt the problem of making a paperclip maximizer is fundamentally impossible, and I’d be pretty confident that we could eventually figure it out if we wanted to (which we don’t), if only we could survive long enough to do arbitrarily much trial-and-error. :-P
Thanks for the organized reply, i’ll try to keep the same format.
You are familiar with the serotogenic and dopaminergic pathways and associated learning systems—typically simplified to an unsupervised learning component and a reward learning component.
My main point is that picture is incomplete/incorrect, and the brain’s main learning system involves some form of empowerment. Curiosity is typically formulated as improvement in prediction capability, so it’s like a derivative of more standard unsupervised learning (and thus probably a component of that system). But that alone isn’t so great at learning for the roughly half the brain involved in action/motor/decision/planning. Some form of ‘empowerment’ criteria—specifically maximization of mutual information between actions and future world state (or observations, but the former is probably better) is a more robust general learning signal for action learning, and seems immune to the problems that plague pure curiosity approaches like the rule 101 type issues you mention.
For example: dopamine release on winning a bet has nothing to do with innate drives, it’s purely an empowerment type learning signal. This is actually just the normal learning system at work.
The brain is mostly explained by this core learning system (which perhaps has just two or three main components). The innate drives (hunger,thirst,comfort/pain,sex,etc) are completely insufficient as signals for training the brain. They are instead satisficing drives that quickly saturate. They are secondary learning signals, but moreover they also can directly control/influence behavior in key situations, like the emotional subsystems. (Naturally there are exceptions to typical saturation—humans with a mutation causing perpetual unsatisfiable deep hunger and thus think about food all day long)
Empowerment that operates over learned world state also could support easy modulation—for example by up-weighting the importance of modeling humans/agents.
The altruism/empathic component isn’t really like those innate drives (it’s not really satisfying/saturating), and so instead is more core, part of the primary utility function and learning systems. (And also probably involves it’s own neuromodulator component through oxytocin).
Human infants grow up around humans who spend a large amount of time talking near the child. It’s actually a dominant component of the audio landscape human infants grow up in. Any reasonably competent UL system will learn a model of human speech just from this training data (and ML systems prove this). Any innate human-speech brainstem circuit is of secondary importance—perhaps it speeds up learning a bit (like the simple brainstem face detector that helps prime the cortex), but it simply can not be necessary—as that would be incompatible with everything we know about the powerful universal learning capability of the brain.
Then once the brain has learned a recognition model of human speech, empowerment based learning is completely sufficient to learn speech production motor skills, simply by learning to maximize the mutual information between larynx motor actions and future predicted human speech audio world state. Again the brain may use some tricks to speed up learning, but the universal learning system is doing all the heavy lifting.
Once a child has learned a model of other humans—parents, friends, general models of other ‘kids’, etc, the empowerment system naturally then tries to learn ways to control these agents. This is so difficult that it basically drives a huge chunk of subsequent learning for most people, and becomes social theory of mind and innate ‘game theory’. Social status is simply a proxy measure for influence, so it’s closely correlated—or even just the same as—maximization of mutual info between actions and future agent beliefs (ie empowerment). If you think of what the word influence means, it’s actually just a definition of a specific form of empowerment.
The ancient innate Satisficing drives are what I think of as the low-level drive category (hunger,thirst,pain,sex,etc).
And finally the core emotions (happiness, sadness, fear, anger) are a third category. They are ancient subsystems that are both behavioral triggers and learning modulators. Happiness/sadness are just manifestations of predicted utility, whereas fear and anger are innate high-stress behavior modes (flight and fight responses). Humans then inherit more complex triggers—such as the injustice/righteousness triggers for anger, and more complex derived emotions.
I would put altruism/empathy in its own category, although it’s also obviously closely connected to the emotion of love. Implementation wise it results in mixing of the learned utility functions of external agents into the agent’s own root utility function. It is essentially evolved alignment. There are good reasons for this to evolve—basically shared genes and disposable somas, and we’ll want something similar in AGI. It’s a social component in the sense that it needs to connect the learned models of external agents to the core utility function.
We want to align AGI, and the brain’s empathic/altruistic system could show us a practical way to achieve that. I don’t see much role for the other emotional circuitry or innate drives. So we mostly agree here except you seem more interested in various ‘social instincts’ beyond just empathy/altruism (alignment).
I believe humans (and more specifically high-impact humans) are mostly explained by a universal/generic learning system optimizing for a few things: mainly some mix of empowerment, curiosity, and altruism/empathy. There are many other brain systems (innate drives, emotions, etc), but they aren’t so relevant.
I also believe brains are efficient, and thus AGI will end up being brain like—specifically it will also be mostly understandable as a universal neural learning system optimizing for some mix of empowerment, curiosity, and altruism/empathy or equivalents. There may be some other components, but they aren’t as important.
Goals and values are complex learned concepts. Initial AGI will not reinvent all of human cultural history, and instead will just absorb human values—as they emerge from a universal learning system training on human world experience data, and AGI will have a similar universal learning system and similar experience training data. This doesn’t imply AGI will have the exact same values of some typical mix of humans. Only that it’s values will be mostly sampled from within the wide human-set.
From the original comment I was replying to (from Jon Garcia, not you):
There are deep reasons to believe AGI will be more anthropomorphic than not—mostly created in the image of humans. AGI will be much closer to a human mind than some hypothetical superintelligent spider.