Steven Byrnes comments on Thought Experiments Provide a Third Anchor

Steven Byrnes 24 Jan 2022 15:00 UTC
4 points
Thanks, this is great, I really feel like we’re converging here. Here’s where I think we stand.
Intrinsic motivation / curiosity:
We both agree that humans have an “intrinsic motivation” drive and that AGI will likewise have an “intrinsic motivation” drive, at least for the early part of training (perhaps it can “fade out” when the AGI is sufficiently smart and self-aware, such that instrumental convergence can substitute for intrinsic motivation?). I’m calling the intrinsic motivation “curiosity”, and I’m punting on the details of how it works. You’re calling it “curiosity / empowerment”, and apparently have something very specific in mind.
I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a “drive to pay attention to humans”, which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech. Without that drive, I think the curiosity would be completely undirected, and you could wind up with an AGI that ignores the world and spends forever running Rule 110 in its head and finding its increasingly-complicated patterns, or studying the coloration of pebbles, etc. Whereas I think you disagree, and you think that “intrinsic motivation”, properly implemented, will automatically point itself at the world and technology and humans etc., and not at patterns-in-rule-110.
We also disagree about “drive for having high social status / impressing my friends”: You think it’s purely a special case of “intrinsic motivation” and thus requires no further explanation, I think it comes at least in part from “social instincts”, i.e. low-level drives that evolved in humans specifically because we are social animals.
I’m not immediately sure how to move forward in resolving either of those. I think you said you were going to have a post explaining more about how you think intrinsic motivation works, so maybe I’ll just wait for that.
Other low-level drives:
I think we agree that humans have some “social” low-level drives like “altruism / empathy” and “justice/anger” (which I’d call a subset of “social instincts”). We might be disagreeing about how complicated social instincts are (e.g. “how many low-level drives”), with me saying they’re probably pretty complicated and you saying they’re simple. But it’s also possible that we’re not disagreeing at all, but rather answering different questions, i.e. “the main aspects of human social instincts” versus “human social instincts in exact detail including subtle mood-shifts based on how somebody smells” or whatever.
I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they’re implemented, (2) advocacy etc. to convince AGI designers to actually put in whatever social instincts we think they ought to put in.
I think we also agree that AGI can have low-level drives very different from any of the low-level drives in humans, like a low-level drive to get a high score in PacMan—not as a means to an end, but rather because the PacMan score is directly baked into the innate within-lifetime reward function. I think you’re inclined to emphasize that most of these possible low-level drives would be terribly dangerous, and I’m inclined to emphasize that future AGI designers might put them in anyway.
Explicit goals:
I think we agree that humans, combining their modestly-heterogeneous innate drives (e.g. psychopaths, people with autism, etc.) with modestly-heterogeneous training data (a.k.a. life history), can wind up pursuing an insane variety of explicit goals, like the guy trying to set a world record for longest time spent bathing in ice-water, etc. etc. So the claim “the AGI may wind up pursuing goals radically unlike humans” is less clear-cut than it sounds. OTOH, “the AGI may wind up pursuing explicit goals unlike typical humans in my culture” is a weaker statement, and I think definitely true. I would even say the stronger thing—that it is in fact possible for a future AGI to wind up pursuing an explicit goal that none of the 100 billion humans in history have ever pursued, e.g. maximizing the quantity of solar cells in the future light-cone, particularly if the AGI is programmed to have a low-level innate drive that no human has ever had, and if AGI designers don’t really know what they’re doing.
Where does that leave anthropomorphism?
When I think of anthropomorphism I have a negative association because I’m thinking of things like my comment here, where somebody was claiming that AGI isn’t dangerous because if an AGI just thought hard enough about it, it would conclude that acting honorably is inherently good and hurting people is inherently bad, because after all, that’s just the way it is. From my perspective, this is problematic anthropomorphism because the process of moral reasoning involves (among other things) queries to low-level “social instincts” drives (especially related to altruism and justice), and whoever builds the AGI won’t necessarily put in the same “social instincts” drives that humans have.
(I could have also pointed out that high-functioning sociopaths often have a very good understanding of honor etc. but not find those things motivating at all. Maybe that’s a general rule: if we see an “anthropomorphism” argument that really only applies to neurotypical people, and not to psychopaths and people with autism etc., then that’s a giant red flag.)
Anyway, when you think of anthropomorphism, it seems that your mind immediately goes to “humans can sometimes be single-mindedly in pursuit of power, and AGIs also can sometimes be single-mindedly in pursuit of power”, which happens to be a statement I agree with. So you wind up with a positive association.
Couple other things:
You could of course use the specific combination of 1.) intrinsic motivation and 2.) [bank] account balance reward, but that also sounds pretty obviously disastrous
Agree, but only if we define “obviously” as “obviously to me and you”. I still think there’s a good chance that somebody would try.
So actually I think if you attempt to work out how to implement that (in a powerful AGI), it’s probably as difficult as making approximately aligned AGI.
Oh, sorry for bad communication, when I said “I think we could make an AGI with that goal [of maximizing paperclips]”, I should have added “in principle”. Obviously right now we can’t make any AGI whatsoever, and additionally we don’t know how to reliably make the AGI that is trying to do some particular thing that we had in mind. I doubt the problem of making a paperclip maximizer is fundamentally impossible, and I’d be pretty confident that we could eventually figure it out if we wanted to (which we don’t), if only we could survive long enough to do arbitrarily much trial-and-error. :-P
- jacob_cannell 24 Jan 2022 22:39 UTC
  10 points
  Parent
  Thanks for the organized reply, i’ll try to keep the same format.
  
  Intrinsic motivation / curiosity:
  
  You are familiar with the serotogenic and dopaminergic pathways and associated learning systems—typically simplified to an unsupervised learning component and a reward learning component.
  
  My main point is that picture is incomplete/incorrect, and the brain’s main learning system involves some form of empowerment. Curiosity is typically formulated as improvement in prediction capability, so it’s like a derivative of more standard unsupervised learning (and thus probably a component of that system). But that alone isn’t so great at learning for the roughly half the brain involved in action/motor/decision/planning. Some form of ‘empowerment’ criteria—specifically maximization of mutual information between actions and future world state (or observations, but the former is probably better) is a more robust general learning signal for action learning, and seems immune to the problems that plague pure curiosity approaches like the rule 101 type issues you mention.
  
  For example: dopamine release on winning a bet has nothing to do with innate drives, it’s purely an empowerment type learning signal. This is actually just the normal learning system at work.
  
  The brain is mostly explained by this core learning system (which perhaps has just two or three main components). The innate drives (hunger,thirst,comfort/pain,sex,etc) are completely insufficient as signals for training the brain. They are instead satisficing drives that quickly saturate. They are secondary learning signals, but moreover they also can directly control/influence behavior in key situations, like the emotional subsystems. (Naturally there are exceptions to typical saturation—humans with a mutation causing perpetual unsatisfiable deep hunger and thus think about food all day long)
  
  Empowerment that operates over learned world state also could support easy modulation—for example by up-weighting the importance of modeling humans/agents.
  
  The altruism/empathic component isn’t really like those innate drives (it’s not really satisfying/saturating), and so instead is more core, part of the primary utility function and learning systems. (And also probably involves it’s own neuromodulator component through oxytocin).
  
  I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a “drive to pay attention to humans”, which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech.
  
  Human infants grow up around humans who spend a large amount of time talking near the child. It’s actually a dominant component of the audio landscape human infants grow up in. Any reasonably competent UL system will learn a model of human speech just from this training data (and ML systems prove this). Any innate human-speech brainstem circuit is of secondary importance—perhaps it speeds up learning a bit (like the simple brainstem face detector that helps prime the cortex), but it simply can not be necessary—as that would be incompatible with everything we know about the powerful universal learning capability of the brain.
  
  Then once the brain has learned a recognition model of human speech, empowerment based learning is completely sufficient to learn speech production motor skills, simply by learning to maximize the mutual information between larynx motor actions and future predicted human speech audio world state. Again the brain may use some tricks to speed up learning, but the universal learning system is doing all the heavy lifting.
  
  We also disagree about “drive for having high social status / impressing my friends”: You think it’s purely a special case of “intrinsic motivation” and thus requires no further explanation,
  
  Once a child has learned a model of other humans—parents, friends, general models of other ‘kids’, etc, the empowerment system naturally then tries to learn ways to control these agents. This is so difficult that it basically drives a huge chunk of subsequent learning for most people, and becomes social theory of mind and innate ‘game theory’. Social status is simply a proxy measure for influence, so it’s closely correlated—or even just the same as—maximization of mutual info between actions and future agent beliefs (ie empowerment). If you think of what the word influence means, it’s actually just a definition of a specific form of empowerment.
  
  Other low-level drives:
  
  The ancient innate Satisficing drives are what I think of as the low-level drive category (hunger,thirst,pain,sex,etc).
  
  And finally the core emotions (happiness, sadness, fear, anger) are a third category. They are ancient subsystems that are both behavioral triggers and learning modulators. Happiness/sadness are just manifestations of predicted utility, whereas fear and anger are innate high-stress behavior modes (flight and fight responses). Humans then inherit more complex triggers—such as the injustice/righteousness triggers for anger, and more complex derived emotions.
  
  I would put altruism/empathy in its own category, although it’s also obviously closely connected to the emotion of love. Implementation wise it results in mixing of the learned utility functions of external agents into the agent’s own root utility function. It is essentially evolved alignment. There are good reasons for this to evolve—basically shared genes and disposable somas, and we’ll want something similar in AGI. It’s a social component in the sense that it needs to connect the learned models of external agents to the core utility function.
  
  I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they’re implemented, (2) advocacy etc.
  
  We want to align AGI, and the brain’s empathic/altruistic system could show us a practical way to achieve that. I don’t see much role for the other emotional circuitry or innate drives. So we mostly agree here except you seem more interested in various ‘social instincts’ beyond just empathy/altruism (alignment).
  
  Where does that leave anthropomorphism?
  
  I believe humans (and more specifically high-impact humans) are mostly explained by a universal/generic learning system optimizing for a few things: mainly some mix of empowerment, curiosity, and altruism/empathy. There are many other brain systems (innate drives, emotions, etc), but they aren’t so relevant.
  
  I also believe brains are efficient, and thus AGI will end up being brain like—specifically it will also be mostly understandable as a universal neural learning system optimizing for some mix of empowerment, curiosity, and altruism/empathy or equivalents. There may be some other components, but they aren’t as important.
  
  Goals and values are complex learned concepts. Initial AGI will not reinvent all of human cultural history, and instead will just absorb human values—as they emerge from a universal learning system training on human world experience data, and AGI will have a similar universal learning system and similar experience training data. This doesn’t imply AGI will have the exact same values of some typical mix of humans. Only that it’s values will be mostly sampled from within the wide human-set.
  
  From the original comment I was replying to (from Jon Garcia, not you):
  
  There is no reason to think that the first AGIs will have goal/value structures any less alien to humans than would a superintelligent spider
  
  There are deep reasons to believe AGI will be more anthropomorphic than not—mostly created in the image of humans. AGI will be much closer to a human mind than some hypothetical superintelligent spider.
  What links here?
  - interstice's comment on Why No *Interesting* Unaligned Singularity? by David Udell (20 Apr 2022 2:24 UTC; 9 points)