jacob_cannell comments on Framing approaches to alignment and the hard problem of AI cognition

jacob_cannell 15 Dec 2021 22:53 UTC
2 points
Just curious—how much time have you invested in the DL literature vs LW/sequences/safety?

One thing that consistently infuriates me is the extent to which the AI-safety community has invented it’s own terminology/onotology that is largely at odds with DL/ML. For example I had to dig deep to discover that ‘inner alignment’ mostly maps to ‘generalization error’, and that ‘consequentialist agent’ mostly maps to model-based RL agent.

It’s like the AI safety community is trying to prevent an unaligned Rome from taking over the world, but they are doing all their work in Hebrew instead of Latin.

Unrestricted, superintelligent, and capable AIs which act like long-term, expected utility maximizers with purely outcome based goals (aka consequentialists) would cause an existential catastrophe if created (mostly by humans) with approaches similar to current ML.

Powerful AGI will be model-based RL (consequentialist) agents in one form or another—not much uncertainty there. The AI safety community doesn’t get to decide the shape of successful agents, so giving up on aligning model-based agents is just . . . giving up.

This assumption is due to an inability to construct a human values utility function, an inability to perfectly inner align an agent’s utility function, Goodhart’s law, and instrumental convergence.

The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.

Achieving human-level alignment probably won’t be easy, but it’s also obviously not impossible. The success of DL in part is success in partial reverse engineering the brain, suggesting future success in reverse engineering human alignment (empathy/altruism).
- Steven Byrnes 16 Dec 2021 3:49 UTC
  4 points
  Parent
  One thing that consistently infuriates me is the extent to which the AI-safety community has invented it’s own terminology/onotology that is largely at odds with DL/ML. For example I had to dig deep to discover that ‘inner alignment’ mostly maps to ‘generalization error’
  Nobody likes jargon (well, nobody worth listening to likes jargon) but there’s a reason that healthy fields have jargon, and it’s because precise communication of ideas within a field is important. “Inner alignment” indeed has some relationship to “generalization error” but they’re not exactly the same thing, and we can communicate better by using both terms where appropriate.
  If your complaint is lack of good pedagogical materials, fair enough. Good pedagogy often exists, but it’s sometimes scattered about. Plus Rob Miles I guess.
  and that ‘consequentialist agent’ mostly maps to model-based RL agent.
  “Consequentialist” is a common English word, defined in the dictionary as “choosing actions based on their anticipated consequences” or something. Then the interesting question is “to what extent do different AI algorithms give rise to consequentialist behaviors”? I don’t think it’s binary, I think it’s a continuum. Some algorithms are exceptionally good at estimating the consequences of actions, even OOD, and use those consequences as the exclusive selection criterion; those would be maximally consequentialist. Some algorithms like GPT-3 are not consequentialist at all.
  I think I’d disagree with “model-based RL = consequentialist”. For example, a model-free RL agent, with a long time horizon, acting in-distribution, does lots of things that look foresighted and strategic, and it does those things because of their likely eventual consequences (as indirectly inferred from past experience). (What is a Q-value if not “anticipated consequences”?) So it seems to me that we should call model-free RL agents “consequentialist” too.
  I would say that model-based RL agents do “explicit planning” (whereas model-free ones usually don’t). I don’t think “agent that does explicit planning” is exactly the same as “consequentialist agent”. But they’re not totally unrelated either. Explicit planning can make an agent more consequentialist, by helping it estimate consequences better, in a wider variety of circumstances.
  (I could be wrong on any of these, this is just my current impression of how people use these terms.)
  - jacob_cannell 16 Dec 2021 7:19 UTC
    4 points
    Parent
    So I said consequentialist mostly maps to model-based RL because “choosing actions based on their anticipated consequences” is just a literal plain english description of how model-based RL works—with the model-based predictive planning being an implementation of “anticipating consequences”.
    
    It’s more complicated for model-free RL, in part because with enough diverse training data and regularization various forms of consequentalist/planning systems could potentially develop as viable low complexity solutions.
    
    But effective consequentalist-planning requires significant compute and recursion depth such that it’s outside the scope of many simpler model-free systems—and i’m thinking primarily of earlier DM atari agents—so instead they often seem to develop a collection of clever heuristics that work well in most situations, without the ability to explicitly evaluate the long term consequences of specific actions in novel situations—thus more deontological.
    - Steven Byrnes 16 Dec 2021 14:06 UTC
      4 points
      Parent
      Hmm, I would say that DQN “chooses actions based on their anticipated consequences” in that the Q-function incorporates an estimate of anticipated consequences. (Especially with a low discount rate.)
      I’m happy to say that model-based RL might be generically better at anticipating consequences (especially in novel circumstances) than model-free RL. Neither is perfect though.
      DQN has an implicit plan encoded in the Q-function—i.e., in state S1 action A1 seems good, and that brings us to state S2 where action A2 seems good, etc. … all that stuff together is (IMO) an implicit plan, and such a plan can involve short-term sacrifices for longer-term benefit.
      Whereas model-based RL with tree search (for example) has an explicit plan: at timestep T, it has an explicit representation of what it’s planning to do at timesteps T+1, T+2, ….
      Humans are able to make explicit plans too, although it doesn’t look like one-timestep-at-a-time.
      - jacob_cannell 16 Dec 2021 18:41 UTC
        2 points
        Parent
        Sure you can consider the TD style unrolling in model-free a sort of implicit planning, but it’s not really consequentialist in most situations as it can’t dynamically explore new relevant expansions of the state tree the way planning can. Or you could consider planning as a dynamic few-shot extension to fast learning/updating the decision function.
        
        Human planning is sometimes explicit timestep by timestep (when playing certain board games for example) when that is what efficient planning demands, but in the more general case human planning uses more complex approximations that more freely jump across spatio-temporal approximation hierarchies.
- ryan_greenblatt 16 Dec 2021 0:31 UTC
  1 point
  Parent
  Just curious—how much time have you invested in the DL literature vs LW/sequences/safety?
  
  Prior to several month ago I had mostly read DL/ML literature. But recently I’ve been reading virtually only alignment literature.
  
  One thing that consistently infuriates me is the extent to which the AI-safety community has invented it’s own terminology/onotology that is largely at odds with DL/ML.
  
  I actually think there are very good reasons the AI-safety community uses different terms (not that we know the right terms/abstractions at the moment). I won’t get into a full argument for this, but a few reasons:
  - Alignment is generally trying to work with high intelligence regimes where concepts like ‘intent’ are better specified.
  - Often, things are presented more generally than just standard ML
  The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.
  
  Children aren’t superintelligent AGIs for which instrumental convergence applies.
  
  ‘consequentialist agent’ mostly maps to model-based RL agent
  
  For current capability regimes, sure. In the future? Not so clear. Consequentialist is a more general idea.
  - jacob_cannell 16 Dec 2021 2:34 UTC
    2 points
    Parent
    
    The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.
    
    Children aren’t superintelligent AGIs for which instrumental convergence applies.
    
    A genealogical descendancy of agents creating/training agents is isomorphic to a single agent undergoing significant self-modification.
    
    For current capability regimes, sure. In the future? Not so clear. Consequentialist is a more general idea.
    
    How is ‘consequentialist’ more general? Do you have a practical example of a consequentalist agent that is different than a general model-based RL agent?
    - Steven Byrnes 16 Dec 2021 3:04 UTC
      3 points
      Parent
      > The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.
      Children aren’t superintelligent AGIs for which instrumental convergence applies.
      I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.
      If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem. For another thing, we’re kinda hazy on what future we want in the first place—I don’t think there’s an infinitesimal target that we need parts-per-googol accuracy to hit. For yet another thing, I do agree with Jacob that if we consider the fact “We’re OK bequeathing the universe to the next generation, even though we don’t really know what they’ll do with it” (assuming you are in fact on board with that, as I think I am and most people are, although I suppose one could say it’s just status quo bias), I think that’s a very interesting datapoint worth thinking about, and again hints that there may be approaches that don’t require parts-per-googol accuracy.
      Normally in this kind of discussion I would be arguing the other side—I do think it will be awfully hard and perhaps impossible to get an AGI to wind up with motivations that are not catastrophically bad for humanity—but “it must be literally perfect” is going too far!
      - jacob_cannell 16 Dec 2021 3:36 UTC
        8 points
        Parent
        This argument about whether human-level alignment is sufficient is at least a decade old. I suspect one issue is that human inter-human alignment is high variance. The phrase “human-level alignment” could conjure up anything from Ghandi to Hitler, from Bob Ross to Jeffrey Dahmer. If you model that as an adversarial draw, it’s pretty bad. As a random draw, it may be better than default unaligned, but still high risk. I tend to view it more optimistically as an optimistic draw, based on reverse engineering human altruism to control/amplify it.
        
        I thought LW/MIRI was generally pessimistic on human-level alignment, but Rob Bensinger said “If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk.” in this comment which was an update for me.
        
        So as a result I tend to see brain reverse engineering as much higher priority than it otherwise would deserve, for both inspiring artificial empathy/altruism and also shortening the timeframe until uploading.
        Steven Byrnes 16 Dec 2021 4:02 UTC
        2 points
        Parent
        I tend to see brain reverse engineering as much higher priority than it otherwise would deserve, for both inspiring artificial empathy/altruism and also shortening the timeframe until uploading
        My take is that the neocortex (and other bits) are running a quasi-general-purpose learning algorithm, and the hypothalamus and brainstem are “steering” that learning algorithm by sending multiple reward signals and other supervisory signals. (The latter are also doing lots of other species-specific instinct stuff that don’t interact with the learning algorithms, like regulating heart rate.)
        So if we reverse-engineer the neocortex learning algorithm first, before learning anything new about the hypothalamus & brainstem, I think that we’d wind up with a recipe for making an AGI with radically alien motivations, but we still wouldn’t know how to make an AGI with human-like empathy / altruism.
        I think there’s circuitry somewhere in the hypothalamus & brainstem that works in conjunction with the learning algorithms to create social instincts, and I’m strongly in favor of figuring out how those circuits work, and that’s one of the things that I’m trying to do myself. :-)
        jacob_cannell 16 Dec 2021 6:49 UTC
        5 points
        Parent
        Yes, I concur. The cortex seems to get 90% or more of the attention in neuroscience, but those smaller more ancient central brain structures probably have more of the innate complexity relevant for the learning machinery. That’s on my reading list (along with some of your brain articles a friend recommended).
      - ryan_greenblatt 16 Dec 2021 7:39 UTC
        1 point
        Parent
        
        I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.
        
        Oh, no, this wasn’t what I meant. I just meant that the usage of children as an example was poor because individual children don’t have the potential to succesfully seek vast power. There certainly is a level of sufficient alignment of a just consequentialist utility function which looks like $1 - ϵ$ as opposed to $1$ . I think this $ϵ$ is pretty low, but I reiterate for ‘purely long-run consequentialists’. Note that $ϵ$ must exceptionally low for this sort of AI not to seek power (assuming that avoiding power seeking is desired for the utility function, perhaps we are fine with power seeking as we have the desired consequentialist values, whatever those may be, locked in).
        
        If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem
        
        Agreed, but these aren’t consequentialist properties. At least that isn’t how I model them.
        
        I shouldn’t have given such a vague response to the child metaphor.