jacob_cannell comments on My take on Jacob Cannell’s take on AGI safety

jacob_cannell 28 Nov 2022 23:18 UTC
LW: 4 AF: 3
0
AF
I’ll start with a basic model of intelligence which is hopefully general enough to cover animals, humans, AGI, etc. You have a model-based agent with a predictive world model W learned primarily through self-supervised predictive learning (ie learning to predict the next ‘token’ for a variety of tokens), a planning/navigation subsystem P which uses W to approximately predict sample important trajectories according to some utility function U, a value function V which computes the immediate net expected discounted future utility of actions from current state (including internal actions), and then some action function A which just samples high value actions based on V. The function of the planning subsystem P is then to train/update V.

The utility function U obviously needs some innate bootstrapping, but brains also can import varying degrees of prior knowledge into other components—and most obviously into V, the value function. Many animals need key functionality ‘out of the box’, which you can get by starting with a useful prior on V/A. The benefit for innate prior knowledge in V/A diminishes as brains scale up in net training compute (size * training time), so that humans—with net training compute ~1e25 ops vs ~1e21 ops for a cat—rely far more on learned knowledge for V/A rather than prior/innate knowledge.

So now to translate into your 3 levels:

A.): Innate drives: Innate prior knowledge in U and in V/A.

B.): Learned from experience and subsumed into system 1: using W/P to train V/A.

C.): System 2 style reasoning: zero shot reasoning from W/P.

(1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals

So your A.) - innate drives—corresponds to U or the initial state of V/A at birth. I agree the example of newborn rodents avoiding birdlike shadows is probably mostly innate V/A—value/action function prior knowledge.

(2) Evidence from sufficiently distant consequences that we can rule out (B) Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future.

Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can’t learn about that long term benefit through trial-and-error, but that isn’t how most learning operates. Learning is mostly driven by the planning system 1 - M/P—which drives updates to V/A based on both current learned V and U—and U by default is primarily estimating empowerment and value of information as universal proxies.

The animals play-fighting is something I have witnessed and studied recently. We have a young dog and a young cat who organically have learned to play several ‘games’. The main game is a simple chase where the larger dog tries to tackle the cat. The cat tries to run/jump to safety. If the dog succeeds in catching the cat, the dog will tackle constrain it on the ground, teasing it for a while. We—the human parents—often will interrupt the game at this point and occasionally punish the dog if it plays too rough and the cat complains. In the earliest phases the cat was about as likely to chase and attack the dog as the other way around, but over time learned it would near always lose wrestling matches and up in a disempowered state.

There is another type of ambush game the cat will play in situations where it can ‘attack’ the dog from safety or in range to escape to safety, and then other types of less rough play fighting they do close to us.

So I suspect that some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned. The dog and cat both separately enjoy catching/chasing balls or small objects, the cat play fights and ‘attacks’ other toys, etc. So early on in their interactions they had these skills available, but those alone are not sufficient to explain the game(s) they play together.

The chase game is well explained by empowerment drive: the cat has learned that allowing the dog to chase it down leads to an intrinsically undesirable disempowered state. This is a much better fit for the data and also has much lower intrinsic complexity than a bunch of innate drives for every specific disempowered situation, vs a general empowerment drive. It’s also empowering for the dog to control and disempower the cat to some extent. So much of innate hunting skill drives seem like just variations and/or mild tweaks to empowerment.

The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn’t using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes

Status is almost all learned B: system 2 W/P planning driving system 1 V/A updates.

Earlier I said - and I don’t see your reply yet, so i’ll repeat it here:

Infants don’t even know how to control their own limbs, but they automatically learn through a powerful general empowerment learning mechanism. That same general learning signal absolutely does not—and can not—discriminate between hidden variables representing limb poses (which it seeks to control) and hidden variables representing beliefs in other humans minds (which determine constraints on the child’s behavior). It simply seeks to control all such important hidden variables.

Social status drive emerges naturally from empowerment, which children acquire by learning cultural theory of mind and folk game theory through learning to communicate with and through their parents. Children quickly learn that hidden variables in their parents have huge effect on their environment and thus try to learn how to control those variables.

It’s important to emphasize that this is all subconscious and subsumed into the value function, it’s not something you are consciously aware of.

I don’t see how heritability tells us much about how innate social status is. Genes can control many hyperparms which can directly or indireclty influence the later learned social status drive. One obvious example is just the relevant weightings of value-of-information (curiosity) vs optionality-empowerment and other innate components of U at different points in time (development periods). I think this is part of the explanation for children who are highly curious about the world and less concerned about social status vs the converse.

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A).

Fun is complex and general/vague—it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.
- Steven Byrnes 30 Nov 2022 16:49 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Thanks!
  One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.
  I also (relatedly?) am pretty against trying to lump the brainstem / hypothalamus and the cortex / BG / etc. into a single learning-algorithm-ish framework.
  I’m not sure if this is exactly your take, but I often see a perspective (e.g. here) where someone says “We should think of the brain as a learning algorithm. Oh wait, we need to explain innate behaviors. Hmm OK, we should think of the brain as a pretrained learning algorithm.”
  But I think that last step is wrong. Instead of “pretrained learning algorithm”, we can alternatively think of the brain as a learning algorithm plus other things that are not learning algorithms. For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.
  To illustrate the difference between “pretrained learning algorithm” and “learning algorithm + other things that are not learning algorithms”:
  Suppose I’m making a robot. I put in a model-based RL system. I also put in a firmware module that detects when the battery is almost empty and when it is, it shuts down the RL system, takes control, and drives the robot back to the charging station.
  Leaving aside whether this is a good design for a robot, or a good model for the brain (it’s not), let’s just talk about this system. Would we describe the firmware module as “importing prior knowledge into components of the RL algorithm”? No way, right? Instead we would describe the firmware module as “a separate component from the RL algorithm”.
  By the same token, I think there are a lot of things happening in the brainstem / hypothalamus which we should describe as “a separate component from the RL algorithm”.
  Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can’t learn about that long term benefit through trial-and-error, but that isn’t how most learning operates. Learning is mostly driven by the planning system 1 - M/P—which drives updates to V/A based on both current learned V and U—and U by default is primarily estimating empowerment and value of information as universal proxies.
  [M/P is a typo for W/P right?]
  Let’s say I wake up in the morning and am deciding whether or not to put a lock pick set in my pocket. There are reasons to think that this might increase my empowerment—if I find myself locked out of something, I can maybe pick the lock. There are also reasons to think that this might decrease my empowerment—let’s say, if I get frisked by a cop, I look more suspicious and have a higher chance of spurious arrest, and also I’m carrying around more weight and have less room in my pockets for other things.
  So, all things considered, is it empowering or disempowering to put the lock pick set into my pocket for the day? It depends. In a city, it’s maybe empowering. On a remote mountain, it’s probably disempowering. In between, hard to say.
  The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.
  So anyway, if an animal could practice skill X versus skill Y as a baby, it is (in general) unknowable which one is a more empowering course of action, in the absence of information about what kinds of situations the animal is likely to find itself in when it’s older. And the animal itself doesn’t know that—it’s just a baby.
  Since I’m a smart adult human, I happen to know that:
  - it’s empowering for baby cats to practice pouncing,
  - it’s empowering for baby bats to practice arm-flapping,
  - it’s empowering for baby humans to practice grasping,
  - it’s not empowering for baby humans to practice arm-flapping,
  - it’s not empowering for baby bats to practice pouncing
  - etc.
  But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.
  So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to newborn bats.
  Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.
  I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage), (2) decorticate rats can move around in more-or-less species-typical ways, (3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is helpful for that, (4) and that there’s stuff in the brainstem that can do this kind of reward-shaping, (5) lots of animals can get around reasonably well within a remarkably short time after birth, (6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion, (7) things like palmar grasp reflex, Moro reflex, stepping reflex, etc. (8) the sheer delight on the face of a baby standing up for the first time, (9) there are certain dopamine signals (from lateral SNc & SNl) that correlate with motor actions specifically, independent of general reward etc. (There’s kinda a long story, that I think connects all these dots, that I’m not getting into.)
  (If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity, stumbling upon useful things, maybe learning-from-observation, etc. I just don’t think those kinds of things are the whole story for early acquisition of species-typical movements like grasping and standing. For example, I figure decorticate rats would probably fail to learn to use a weird novel motor affordance, but decorticate rats do move around in more-or-less species-typical ways.)
  some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned
  Sure, I agree.
  The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn’t using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.
  Yeah pulling punches is one thing. Another thing is that animals have universal species-specific somewhat-arbitrary signals that they’re playing, including certain sounds (laughing in humans) and gestures (“play bow” in dogs).
  My more basic argument is that the desire to play-fight in the first place, as opposed to just relaxing or whatever, is an innate drive. I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that.
  Fun is complex and general/vague—it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.
  Fair enough.
  - jacob_cannell 30 Nov 2022 19:25 UTC
    LW: 4 AF: 3
    0
    AF Parent
    
    One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.
    
    I think we agree the cortex/cerebellum are randomly initialized, along with probably most of the hippocampus, BG, perhaps amagdyla? and a few others. But those don’t map cleanly to U, W/P, and V/A.
    
    For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.
    
    Of course—and that is just innate unlearned knowledge in V/A. V/A (value and action) generally go together, because any motor/action skills need pairing with value estimates so the BG can arbitrate (de-conflict) action selection.
    
    The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.
    
    Empowerment and value-of-information (curiosity) estimates are always relative to current knowledge (contextual to the current wiring and state of W/P and V/A). Doing X in the morning generally will have variable optionality value depending on the contextual state, goals/plans, location, etc. I’m not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.
    
    The newborns VoI and optionality value estimates will be completely different and focused on things like controlling flailing limbs and making sounds, moving the head, etc.
    
    But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.
    
    There’s nothing to ‘figure out’ - it just works. If you’re familiar with the approximate optionality-empowerment literature, it should be fairly obvious that a generic agent maximizing optionality, will end up flapping it’s wing-arms when controlling a bat body, flailing limbs around in a newborn human body, balancing pendulums, learning to walk, etc. I’ve already linked all this—but maximizing optionality automatically learns all motor skills—even up to bipedal walking.
    
    So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to bats.
    
    Human babies absolutely do the equivalent experiments—most of the difference is simply due to large differences in the arm structure. The bat’s long extensible arms are built to flap, the human infants’ short stubby arms are built to flail.
    
    Also keep in mind that efficient optionality is approximated/estimated from a sampling of likely actions in the current V/A set, so it naturally and automatically takes advantage of any prior knowledge there. Perhaps the bat does have prior wiring in V/A that proposes&generates simple flapping that can be improved
    
    Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.
    
    This just doesn’t fit the data at all. Humans clearly learn to stand and walk. They may have some innate bias in V/U which makes that subgoal more attractive, but that is intrinsically more complex addition to the basic generic underlying optionality control drive.
    
    I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage),
    
    We’ve already been over that—consistent layout is not strong evidence of innate wiring. A generic learning system will learn similar solutions given similar inputs & objectives.
    
    (2) decorticate rats can move around in more-or-less species-typical ways,
    
    The general lesson from the decortication experiments is that smaller brain mammals rely on (their relatively smaller) cortex less. Rats/rabbits can do much without the cortex and have many motor skills available at birth. Cats/dogs need to learn a bit more, and then primates—especially larger ones—need to learn much more and rely on the cortex heavily. This is extreme in humans, to the point where there is very little innate motor ability left, and the cortex does almost everything.
    
    (3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is certainly helpful for that,
    
    It takes humans longer than an entire rat lifespan just to learn to walk. Hardly fast.
    
    (4) and that there’s stuff in the brainstem that can do this kind of reward-shaping,
    
    Sure, but there is hardly room in the brainstem to reward-shape for the $1 e^{100}$ different things humans can learn to do.
    
    Universal capability requires universal learning.
    
    (5) lots of animals can get around reasonably well within a remarkably short time after birth,
    
    Not humans.
    
    (6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion,
    
    Unless that is true for infants, it’s just learned V components. I doubt infants have an urge to move the arm in a coordinated way, vs lower level muscle ‘urges’, but even if they did that’s just some prior knowledge in V.
    
    (If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity,
    
    We know that humans can learn to see through their tongue—and this does not take much longer than an infant learning to see through its eyes.
    
    I think we both agree that sensory cortex uses a pretty generic universal learning algorithm (driven by self supervised predictive learning). I just also happen to believe the same applies to motor and higher cortex (driven by some mix of VoI, optionality control, etc).
    
    I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that
    
    Comments like these suggest you don’t have the same model of optionality-empowerment as I do. When the cat was pinned down by the dog in the past, it’s planning subsystem computed low value for that state—mostly based on lack of optionality—and subsequently the V system internalizes this as low value for that state and states leading towards it. Afterwards when entering a room and seeing the dog on the other side, the W/P planning system quickly evaluates a few options like: (run into the center and jump up onto the table), (run into the center and jump onto the couch), (run to the right and hide behind the couch), etc—and subplan/action (run into the center ..) gets selected in part because of higher optionality. It’s just an intrinsic component of how the planning system chooses options on even short timescales, and chains recursively through training V/A.
    - Steven Byrnes 1 Dec 2022 3:38 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Thanks!
      I’m not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.
      If we assume omniscience, it allows a very convenient type of argument:
      Argument I [invalid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Is X actually empowering?
      However, if we don’t assume omniscience, then we can’t make arguments of that form. Instead we need to argue:
      Argument II [valid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Has the animal come to believe (implicitly or explicitly) that doing X is empowering?
      I have the (possibly false!) impression that you’ve been implicitly using Argument I sometimes. That’s how omniscience came up.
      For example, has a newborn bat come to believe (implicitly or explicitly) that flapping its arm-wings is empowering? If so, how did it come to believe that? The flapping doesn’t accomplish anything, right? They’re too young and weak to fly, and don’t necessarily know that flying is an eventual option to shoot for. (I’m assuming that baby bats will practice flapping their wings even if raised away from other bats, but I didn’t check, I can look it up if it’s a crux.) We can explain a sporadic flap or two as random exploration / curiosity, but I think bats practice flapping way too much for that to be the whole explanation.
      Back to play-fighting. A baby animal is sitting next to its sibling. It can either play-fight, or hang out doing nothing. (Or cuddle, or whatever else.) So why play-fight?
      Here’s the answer I prefer. I note that play-fighting as a kid presumably makes you a better real-fighter as an adult. And I don’t think that’s a coincidence; I think it’s the main point. In fact, I thought that was so obvious that it went without saying. But I shouldn’t assume that—maybe you disagree!
      If you agree that “child play-fighting helps train for adult real-fighting” not just coincidentally but by design, then I don’t see the “Argument II” logic going through. For example, animals will play-fight even if they’ve never seen a real fight in their life.
      So again: Why don’t your dog & cat just ignore each other entirely? Sure, when they’re already play-fighting, there are immediately-obvious reasons that they don’t want to be pinned. But if they’re relaxing, and not in competition over any resources, why go out of their way to play-fight? How did they come to believe that doing so is empowering? Or if they are in competition over resources, why not real-fight, like undomesticated adult animals do?
      maximizing optionality automatically learns all motor skills—even up to bipedal walking
      I agree, but I don’t think that’s strong evidence that nothing else is going on in humans. For example, there’s a “newborn stepping reflex”—newborn humans have a tendency to do parts of walking, without learning, even long before their muscles and brains are ready for the whole walking behavior. So if you say “a simple generic mechanism is sufficient to explain walking”, my response is “Well it’s not sufficient to explain everything about how walking is actually implemented in humans, because when we look closely we can see non-generic things going on”.
      Here’s a more theoretical perspective. Suppose I have two side-by-side RL algorithms, learning to control identical bodies. One has a some kind of “generic” empowerment reward. The other has that same reward, plus also a reward-shaping system directly incentivizing learning to use some small number of key affordances that are known to work well for that particular body (e.g. standing).
      I think the latter would do all the same things as the former, but it would learn faster and more reliably, particularly very early on. Agree or disagree? If you agree, then we should expect to find that in the brain, right?
      (When I say “more reliably”, I’m referring to the trope that programming RL agents is really finicky, moreso than other types of ML. I don’t really know if that trope is correct though.)
      Sure, but there is hardly room in the brainstem to reward-shape for the [ $10^{100}$ ] different things humans can learn to do.
      I hope we’re not having one of those silly arguments where we both agree that empowerment explains more than 0% and less than 100% of whatever, and then we’re going back and forth saying “It’s more than 0%!” “No way, it’s less than 100%!” “No way, it’s more than 0%!” … :)
      Anyway, I think the brainstem “knows about” some limited number of species-typical behaviors, and can probably execute those behaviors directly without learning, and also probably reward-shapes the cortex into learning those behaviors faster. Obviously I agree that the cortex can also learn pretty much arbitrary other behaviors, like ballet and touch-typing, which are not specifically encoded in the brainstem.