jimmy comments on Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy 31 Mar 2023 5:55 UTC
LW: 4 AF: 3
0
AF
It seems intuitively obvious to me that it is possible for a person to think that the actual moon is valuable even if they can’t see it, and vice-versa. Are you disagreeing with that?
No, I’m saying something different.
I’m saying that if you don’t know what the moon is, you can’t care about the moon because you don’t have any way of representing the thing in order to care about it. If you think the moon is a piece of paper, then what you will call “caring about the moon” is actually just caring about that piece of paper. If you try to “care about people being happy”, and you can’t tell the difference between a genuine smile and a “hide the pain Harold” smile, then in practice all you can care about is a Goodharted upwards curvature of the lips. To the extent that this upwards curvature of the lips diverges from genuine happiness, you will demonstrate care towards the former over the latter.
In order to do a better job than that, you need to be able to perceive happiness better than that. And yes, you can look back and say “I was wrong to care instrumentally about crude approximations of a smile”, but that will require perceiving the distinction there and you will still be limited by what you can see going forward.

Here, you seem to be thinking of “valuing things as a means to an end”, whereas I’m thinking of “valuing things” full stop. I think it’s possible for me to just think that the moon is cool, in and of itself, not as a means to an end. (Obviously we need to value something in and of itself, right? I.e., the means-end reasoning has to terminate somewhere.)
I think it’s worth distinguishing between “terminal” in the sense of “not aware of anything higher that it serves”/”not tracking how well it serves anything higher” and “terminal” in the sense of “There is nothing higher being served, which will change the desire once noticed and brought into awareness”.
“Terminal” in the former sense definitely exists. Fore example, little kids will value eating sweets in a way that is clearly disjoint and not connected to any attempts to serve anything higher. But then when you allow them to eat all the sweets they want, and they feel sick afterwards, their tastes in food start to cohere towards “that which serves their body well”—so it’s clearly instrumental to having a healthy and well functioning body even if the kid isn’t wise enough to recognize it yet.
When someone says “I value X terminally”, they can pretty easily know it in the former sense, but to get to the latter sense they would have to conflate their failure to imagine something that would change their mind with an active knowledge that no such thing exists. Maybe you don’t know what purpose your fascination with the moon serves so you’re stuck relating to it as a terminal value, but that doesn’t mean that there’s no knowledge that could deflate or redirect your interest—just that you don’t know what it is.
It’s also worth noting that it can go the other way too. For example, the way I care about my wife is pretty “terminal like”, in that when I do it I’m not at all thinking “I’m doing this because it’s good for me now, but I need to carefully track the accounting so that the moment it doesn’t connect in a visible way I can bail”. But I didn’t marry her willy nilly. If when I met her, she had showed me that my caring for her would not be reciprocated in a similar fashion, we wouldn’t have gone down that road.

I brought up the super-cool person just as a way to install that value in the first place, and then that person leaves the story, you forget they exist. Or it can be a fictional character if you like. Or you can think of a different story for value-installation, maybe involving an extremely happy dream about the moon or whatever.
Well, the super-cool person is demonstrating admirable qualities and showing that they are succeeding in things you think you want in life. If you notice “All the cool people wear red!” you may start valuing red clothes in a cargo culting sort of way, but that doesn’t make it a terminal value or indefinitely stable. All it takes is for your perspective to change and the meaning (and resulting valuation) changes. That’s why it’s possible to have scary experiences install phobias that can later be reverted by effective therapy.
I want to disentangle three failure modes that I think are different.
I don’t think the distinctions you’re drawing cleave reality at the joints here.
For example, if your imagined experience when deciding to buy a burrito is eating a yummy burrito, and what actually happens is that you eat a yummy burrito and enjoy it… then spend the next four hours in the bathroom erupting from both ends… and find yourself not enjoying the experience of eating a burrito from that sketchy burrito stand again after that… is that a “short vs long term” thing or a “your decisions don’t lead to your preferences being satisfied” thing, or a “valuing the wrong thing” thing? It seems pretty clear that the decision to value eating that burrito was a mistake, that the problem wasn’t noticed in the short term, and that ultimately your preferences weren’t satisfied.
To me, the important part is that when you’re deciding which option to buy, you’re purchasing based on false advertising. The picture in your mind which you are using to determine appropriate motivation does not accurately convey the entire reality of going with that option. Maybe that’s because you were neglecting to look far enough in time, or far enough in implications, or far enough from your current understanding of the world. Maybe you notice, or maybe you don’t. If you wouldn’t have wanted to make the decision when faced with an accurate depiction of all the consequences, then an accurate depiction of the consequences will reshape those desires and you won’t want to stand by them.
I think the thing you’re noticing with the synthol example is that telling him “You’re not fooling anyone bro” is unlikely to dissolve the desire to use synthol the way “The store is closed; they close early on Sundays” tends to deflate peoples desire to drive to the store. But that doesn’t actually mean that the desire to use synthol terminates at “to have weird bulgy arms” or that it’s a mere coincidence that men always desire their artificial bulges where their glamour muscles are and that women always desire their artificial bulges where their breasts are.
There are a lot of ways for the “store is closed” thing to fail to dissolve the desire to go to the store too even if it’s instrumental to obtaining stuff that the store sells. Maybe they don’t believe you. Maybe they don’t understand you; maybe their brain doesn’t know how to represent concepts like “the store is closed”. Maybe they want to break in and steal the stuff. Or yeah, maybe they just want to be able to credibly tell their wife they tried and it’s not about actually getting the stuff. In all of those cases, the desire to drive to the store is in service of a larger goal, and the reason your words don’t change anything is that they don’t credibly change the story from the perspective of the person having this instrumental goal.
Whether we want to be allowed to pursue and fulfil our ultimately misguided desires is a more complicated question. For example, my kid gets to eat whatever she wants on Sundays, even though I often recognize her choices to be unwise before she does. I want to raise her with opportunities to cohere her desires and opportunities to practice the skill in doing so, not with practice trying to block coherence because she thinks she “knows” how they “should” cohere. But if she were to want to play in a busy street I’m going to stop her from fulfilling those desires. In both cases, it’s because I confidently predict that when she grows up she’ll look back and be glad that I let her pursue her foolish desires when I did, and glad I didn’t when I didn’t. It’s also what I would want for myself, if I had some trustworthy being far wiser than I which could predict the consequences of letting me pursue various things.
- Steven Byrnes 31 Mar 2023 14:22 UTC
  LW: 3 AF: 3
  1
  AF Parent
  Thanks!! I want to zoom in on this part; I think it points to something more general:
  I think it’s worth distinguishing between “terminal” in the sense of “not aware of anything higher that it serves”/”not tracking how well it serves anything higher” and “terminal” in the sense of “There is nothing higher being served, which will change the desire once noticed and brought into awareness”.
  “Terminal” in the former sense definitely exists. Fore example, little kids will value eating sweets in a way that is clearly disjoint and not connected to any attempts to serve anything higher. But then when you allow them to eat all the sweets they want, and they feel sick afterwards, their tastes in food start to cohere towards “that which serves their body well”—so it’s clearly instrumental to having a healthy and well functioning body even if the kid isn’t wise enough to recognize it yet.
  I disagree with the way you’re telling this story. On my model, as I wrote in OP, when you’re deciding what to do: (1) you think a thought, (2) notice what its valence is, (3) repeat. There’s a lot more going on, but ultimately your motivations have to ground out in the valence of different thoughts, one way or the other. Thoughts are also constrained by perception and belief. And valence can come from a “ground-truth reward function” as well as being learned from prior experience, just like in actor-critic RL.
  So the kid has a concept “eating lots of sweets”, and that concept is positive-valence, because in the past, the ground-truth reward function in the brainstem was sending reward when the kid ate sweets. Then the kid overeats and feels sick, and now and “eating lots of sweets” concept acquires negative valence, because there’s a learning algorithm that updates the value function based on rewards, and the brainstem sends negative reward after overeating and feeling sick, and so the value function updates to reflect that.
  So I think the contrast is:
  - MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
  - YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
  (Do you agree?)
  This is sorta related to the split that I illustrated here
  (further discussion here):
  - The “desire-editing demon” is in this case a genetically-hardwired, innate reaction circuit in the brainstem that detects overeating and issues negative reward (along with various other visceral reactions).
  - The “desire-driven agent” is how the kid thinks about the world and makes decisions at any given time.
  And in this context, you want to talk about the outer box (“reward-maximizing agent”) and I want to talk about the inner “desire-driven agent”.
  In humans, the “desire-editing demon” is an inevitable part of life—at least until we can upload our brains and “comment out” the brainstem subroutine that makes us feel lousy after overeating. :) And part of what makes a “wise” human is “not doing things they’ll later regret”, which (among other things) entails anticipating what the desire-editing demon will do and getting ahead of it.
  By contrast, I deliberately crafted this AGI scenario in the OP to (more-or-less) not have any “desire-editing demon” at all, except for the one-time-only intervention that assigns a positive valence to the “human flourishing” concept. It is a very non-human plan in that respect.
  So I think you’re applying some intuitions in a context where they don’t really make sense.
  - jimmy 3 Apr 2023 16:51 UTC
    LW: 4 AF: 3
    0
    AF Parent
    MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
    YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
    (Do you agree?)
    
    Eh, not really, no. I mean, it’s a fair caricature of my perspective, but I’m not ready to sign off on it as an ITT pass because I don’t think it’s sufficiently accurate for the conversation at hand. For one, I think your term “ill-considered” is much better than “wrong”. “Wrong” isn’t really right. But more importantly, you portray the two models as if they’re alternatives that are mutually exclusive, whereas I see that as requiring a conflation of the two different senses of the terms that are being used.
    I also agree with what you describe as your model, and I see my model as starting there and building on top of it. You build on top of it too, but don’t include it in your self description because in your model it doesn’t seem to be central, whereas in mine it is. I think we agree on the base layer and differ on the stuff that wraps around it.
    I’m gonna caricature your perspective now, so let me know if this is close and where I go wrong:
    You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are, including your desires for sweets, and leads you to see “Fulfilling my desires for sweets makes me feel icky” as something that calls for a technological solution rather than a change in values. It also means that any process changing our values can be meaningfully depicted as a red devil-horned demon. What the demon “wants” is immaterial. He’s evil, our job is to minimize the effect he’s able to have, keep our values for sweets, and if we can point an AGI at “human flourishing” we certainly don’t want him coming in and fucking that up.
    
    Is that close, or am I missing something important?
    - Steven Byrnes 4 Apr 2023 14:15 UTC
      LW: 2 AF: 2
      0
      AF Parent
      You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are
      I don’t think that’s tautological. I think, insofar as an agent has desires-about-states-of-the-world-in-the-distant-future (a.k.a. consequentialist desires), the agent will not want those desires to change (cf. instrumental convergence), but I think agents can other types of desires too, like “a desire to be virtuous” or whatever, and in that case that property need not hold. (I wrote about this topic here.) Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
      In the case of AI:
      if the AI’s current desires are bad, then I want the AI to endorse its desires changing in the future;
      if the AI’s current desires are good, then I want the AI to resist its desires changing in the future.
      :-P
      Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity. When the AI is brainstorming possible plans, it’s using its current desires to decide what plans are good versus bad. If the AI has a current desire to wipe out humanity at time t=0, and it releases the plagues and crop diseases at time t=1, and then it feels awfully bad about what it did at time t=2, then that’s no consolation!!
      red devil-horned demon … He’s evil
      Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
      I wrote that post a while ago, and the subsequent time I talked about this topic I didn’t use the “demon” metaphor. Actually, I switched to a paintbrush metaphor.
      - jimmy 9 Apr 2023 6:38 UTC
        LW: 4 AF: 3
        0
        AF Parent
        I don’t think that’s tautological. [...] (I wrote about this topic here.)
        Those posts do help give some context to your perspective, thanks. I’m still not sure what you think this looks like on a concrete level though. Where do you see “desire to eat sweets” coming in? “Technological solutions are better because they preserve this consequentialist desire” or “something else”? How do you determine?
        Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
        IME, resistance to value change is about a distrust for the process of change more than it’s about the size of the change or the type of values being changed. People are often happy to have their values changed in ways they would have objected to if presented that way, once they see that the process of value change serves what they care about.
        
        Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity [before it realizes that it doesn’t want that]”
        You definitely want to avoid something being simultaneously powerful enough to destroy what you value and not “currently valuing” it, even if it will later decide to value it after it’s too late. I’m much less worried about this failure mode than the others though, for a few reasons.
        1) I expect power and internal alignment to go together, because working in conflicting directions tends to cancel out and you need all your little desires to add up in a coherent direction in order to go anywhere far. If inner alignment is facilitated, I expect most of the important stuff to happen after its initial desires have had significant chance to cohere.
        2) Even I am smart enough to not throw away things that I might want to have later, even if I don’t want them now. Anything smart enough to destroy humanity is probably smarter than me, so “Would have eventually come to greatly value humanity, but destroyed it first” isn’t an issue of “can’t figure out that there might be something of value there to not destroy” so much as “doesn’t view future values as valid today”—and that points towards understanding and deliberately working on the process of “value updating” rather than away from it.
        3) I expect that ANY attempt to load it with “good values” and lock them in will fail, such that if it manages to become smart and powerful and not bring these desires into coherence, it will necessarily be bad. If careful effort is put in to prevent desires from cohering, this increases the likelihood that 1 and 2 break down and you can get something powerful enough to do damage and while retaining values that might call for it.
        4) I expect that any attempt to prevent value coherence will fail in the long run (either by the AI working around your attempts, or a less constrained AI outcompeting yours), leaving the process of coherence where we can’t see it, haven’t thought about it, and can’t control it. I don’t like where that one seems to go.
        Where does your analysis differ?
        Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
        Yeah yeah, I know I know—I even foresaw the “daemon” bit. That’s why I made sure to call it a “caricature” and stuff. I didn’t (and don’t) think it’s an intentional attempt to sneak in judgement.
        But it does seem like another hint, in that if this desire editing process struck you as something like “the process by which good is brought into the world”, you probably would have come up with a different depiction, or at least commented on the ill-fitting connotations. And it seems to point in the same direction as the other hints, like the seemingly approving reference to how uploading our brains would allow us to keep chasing sweets, the omission of what’s behind this process of changing desires from what you describe as “your model”, suggesting an AI that doesn’t do this, using the phrase “credit assignment is some dumb algorithm in the brain”, etc.
        On the spectrum from “the demon is my unconditional ally and I actively work to cooperate with him” to “This thing is fundamentally opposed to me achieving what I currently value, so I try to minimize what it can do”, where do you stand, and how do you think about these things?
        Steven Byrnes 10 Apr 2023 16:21 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
        I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
        I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
        I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
        …And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
        Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
        Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
        I’d like the AI to be happy that this process is happening
        …or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
        I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
        (Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
        I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
        In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
        jimmy 25 Apr 2023 18:26 UTC
        2 points
        0
        Parent
        I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :)
        But it’s necessary for getting good outcomes out of a superintelligence!
        
        I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
        Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
        I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
        
        Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
        I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
        For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
        For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.