Q: Wouldn’t the AGI self-modify to make itself falsely believe that there’s a lot of human flourishing? Or that human flourishing is just another term for hydrogen?
A: No, for the same reason that, if a supervillain is threatening to blow up the moon, and I think the moon is super-cool, I would not self-modify to make myself falsely believe that “the moon” is a white circle that I cut out of paper and taped to my ceiling. [...] I’m using my current value function to evaluate the appeal (valence) of thoughts.
It’s worth noting that humans fail at this all the time.
Q: Wait hang on a sec. [...] how do you know that those neural activations are really “human flourishing” and not “person saying the words ‘human flourishing’”, or “person saying the words ‘human flourishing’ in a YouTube video”, etc.?
Humans screw this up all the time too, and these two failure modes are related.
You can’t value what you can’t perceive, and when your only ability to perceive “the moon” is the image you see when you look up, then that is what you will protect, and that white circle of paper will do it for you.
For an unusually direct visual level, bodybuilding is supposedly about building a muscular body, but sometimes people will use synthol to create the false appearance of muscle in a way that is equivalent to taping a square piece of paper to the ceiling and calling it a “moon”. The fact that it doesn’t even look a little like real muscle hints that it’s probably a legitimate failure to notice what they want to care about rather than simply being happy to fool other people into thinking they’re strong.
For a less direct but more pervasive example, people will value “peace and harmony” within their social groups, but due to myopathy this often turns into short sighted avoidance of conflict and behaviors that make conflict less solvable and less peace and harmony.
With enough experience, you might notice that protecting the piece of paper on the ceiling doesn’t get that super cool person to approve of your behavior, and you might learn to value something more tied to the actual moon. Just as with more experience consuming excess sweets, you might learn that the way you feel after doesn’t seem to go with getting what your body wanted, and you might find your tastes shifting in wiser directions.
But people aren’t always that open to this change.
If I say “Your paper cutout isn’t the moon, you fool”, listening to me means you’re going to have to protect a big rock a bazillion miles beyond your reach, and you’re more likely to fail that than protecting the paper you put up. And guess what value function you’re using to decide whether to change your values here? Yep, that one saying that the piece of paper counts. You’re offering less chance of having “a moon”, and relative to the current value system which sees a piece of paper as a valid moon, that’s a bad deal. As a result, the shallowness and mis-aimedness of the value gets protected.
In practice, it happens all the time. Try explaining to someone that what they’re calling “peace and harmony values” is really just cowardice and is actively impeding work towards peace and harmony, and see how easy it is, for example.
It’s true that “A plan is a type of thought, and I’m using my current value function to evaluate the appeal (valence) of thoughts” helps protect well formed value systems from degenerating into wireheading, but it also works to prevent development into values which preempt wireheading, and we tend to be not so fully developed that fulfilling our current values excellently does not constitute wireheading of some form. And it’ss also the case that when stressed, people will sometimes cower away from their more developed goals (“Actually, the moon is a big rock out in space...”) and cling to their shallower and easier to fulfill goals (“This paper is the moon. This paper is the moon..”). They’ll want not to, but it’ll happen all the same when there’s enough pressure to.
Sorting out how to best facilitate this process of “wise value development” so as to dodge these failure modes strikes me as important.
Thanks! I want to disentangle three failure modes that I think are different.
(Failure mode A) In the course of executing the mediocre alignment plan of the OP, we humans put a high positive valence on “the wrong” concept in the AGI (where “wrong” is defined from our human perspective). For example, we put a positive valence on the AGI’s concept of “person saying the words ‘human flourishing’ in a YouTube video” when we meant to put it on just “human flourishing”.
I don’t think there’s really a human analogy for this. You write “bodybuilding is supposedly about building a muscular body”, but, umm, says who? People have all kinds of motivations. If Person A is motivated to have a muscular body, and Person B is motivated to have giant weird-looking arms, then I don’t want to say that Person A’s preferences are “right” and Person B’s are “wrong”. (If Person B were my friend, I might gently suggest to them that their preferences are “ill-considered” or “unwise” or whatever, but that’s different.) And then if Person B injects massive amounts of synthol, that’s appropriate given their preferences. (Unless Person B also has a preference for not getting a heart attack, of course!)
(Failure mode B) The AGI has a mix of short-term preferences and long-term preferences. It makes decisions driven by its short-term preferences, and then things turn out poorly as judged by its long-term preferences.
This one definitely has a human analogy. And that’s how I’m interpreting your “peace and harmony” example, at least in part.
Anyway, yes this is a failure mode, and it can happen in humans, and it can also happen in our AGI, even if we follow all the instructions in this OP.
(Failure mode C) The AGI has long-term preferences but, due to ignorance / confusion / etc., makes decisions that do not lead to those preferences being satisfied.
This is again a legit failure mode both for humans and for an AGI aligned as described in this OP. I think you’re suggesting that the “peace and harmony” thing has some element of this failure mode too, which seems plausible.
You can’t value what you can’t perceive, and when your only ability to perceive “the moon” is the image you see when you look up, then that is what you will protect, and that white circle of paper will do it for you.
I’m not sure where you’re coming from here. If I want to care about the well-being of creatures outside my lightcone, who says I can’t?
It seems intuitively obvious to me that it is possible for a person to think that the actual moon is valuable even if they can’t see it, and vice-versa. Are you disagreeing with that?
With enough experience, you might notice that protecting the piece of paper on the ceiling doesn’t get that super cool person to approve of your behavior
Here, you seem to be thinking of “valuing things as a means to an end”, whereas I’m thinking of “valuing things” full stop. I think it’s possible for me to just think that the moon is cool, in and of itself, not as a means to an end. (Obviously we need to value something in and of itself, right? I.e., the means-end reasoning has to terminate somewhere.) I brought up the super-cool person just as a way to install that value in the first place, and then that person leaves the story, you forget they exist. Or it can be a fictional character if you like. Or you can think of a different story for value-installation, maybe involving an extremely happy dream about the moon or whatever.
It seems intuitively obvious to me that it is possible for a person to think that the actual moon is valuable even if they can’t see it, and vice-versa. Are you disagreeing with that?
No, I’m saying something different.
I’m saying that if you don’t know what the moon is, you can’t care about the moon because you don’t have any way of representing the thing in order to care about it. If you think the moon is a piece of paper, then what you will call “caring about the moon” is actually just caring about that piece of paper. If you try to “care about people being happy”, and you can’t tell the difference between a genuine smile and a “hide the pain Harold” smile, then in practice all you can care about is a Goodharted upwards curvature of the lips. To the extent that this upwards curvature of the lips diverges from genuine happiness, you will demonstrate care towards the former over the latter.
In order to do a better job than that, you need to be able to perceive happiness better than that. And yes, you can look back and say “I was wrong to care instrumentally about crude approximations of a smile”, but that will require perceiving the distinction there and you will still be limited by what you can see going forward.
Here, you seem to be thinking of “valuing things as a means to an end”, whereas I’m thinking of “valuing things” full stop. I think it’s possible for me to just think that the moon is cool, in and of itself, not as a means to an end. (Obviously we need to value something in and of itself, right? I.e., the means-end reasoning has to terminate somewhere.)
I think it’s worth distinguishing between “terminal” in the sense of “not aware of anything higher that it serves”/”not tracking how well it serves anything higher” and “terminal” in the sense of “There is nothing higher being served, which will change the desire once noticed and brought into awareness”.
“Terminal” in the former sense definitely exists. Fore example, little kids will value eating sweets in a way that is clearly disjoint and not connected to any attempts to serve anything higher. But then when you allow them to eat all the sweets they want, and they feel sick afterwards, their tastes in food start to cohere towards “that which serves their body well”—so it’s clearly instrumental to having a healthy and well functioning body even if the kid isn’t wise enough to recognize it yet.
When someone says “I value X terminally”, they can pretty easily know it in the former sense, but to get to the latter sense they would have to conflate their failure to imagine something that would change their mind with an active knowledge that no such thing exists. Maybe you don’t know what purpose your fascination with the moon serves so you’re stuck relating to it as a terminal value, but that doesn’t mean that there’s no knowledge that could deflate or redirect your interest—just that you don’t know what it is.
It’s also worth noting that it can go the other way too. For example, the way I care about my wife is pretty “terminal like”, in that when I do it I’m not at all thinking “I’m doing this because it’s good for me now, but I need to carefully track the accounting so that the moment it doesn’t connect in a visible way I can bail”. But I didn’t marry her willy nilly. If when I met her, she had showed me that my caring for her would not be reciprocated in a similar fashion, we wouldn’t have gone down that road.
I brought up the super-cool person just as a way to install that value in the first place, and then that person leaves the story, you forget they exist. Or it can be a fictional character if you like. Or you can think of a different story for value-installation, maybe involving an extremely happy dream about the moon or whatever.
Well, the super-cool person is demonstrating admirable qualities and showing that they are succeeding in things you think you want in life. If you notice “All the cool people wear red!” you may start valuing red clothes in a cargo culting sort of way, but that doesn’t make it a terminal value or indefinitely stable. All it takes is for your perspective to change and the meaning (and resulting valuation) changes. That’s why it’s possible to have scary experiences install phobias that can later be reverted by effective therapy.
I want to disentangle three failure modes that I think are different.
I don’t think the distinctions you’re drawing cleave reality at the joints here.
For example, if your imagined experience when deciding to buy a burrito is eating a yummy burrito, and what actually happens is that you eat a yummy burrito and enjoy it… then spend the next four hours in the bathroom erupting from both ends… and find yourself not enjoying the experience of eating a burrito from that sketchy burrito stand again after that… is that a “short vs long term” thing or a “your decisions don’t lead to your preferences being satisfied” thing, or a “valuing the wrong thing” thing? It seems pretty clear that the decision to value eating that burrito was a mistake, that the problem wasn’t noticed in the short term, and that ultimately your preferences weren’t satisfied.
To me, the important part is that when you’re deciding which option to buy, you’re purchasing based on false advertising. The picture in your mind which you are using to determine appropriate motivation does not accurately convey the entire reality of going with that option. Maybe that’s because you were neglecting to look far enough in time, or far enough in implications, or far enough from your current understanding of the world. Maybe you notice, or maybe you don’t. If you wouldn’t have wanted to make the decision when faced with an accurate depiction of all the consequences, then an accurate depiction of the consequences will reshape those desires and you won’t want to stand by them.
I think the thing you’re noticing with the synthol example is that telling him “You’re not fooling anyone bro” is unlikely to dissolve the desire to use synthol the way “The store is closed; they close early on Sundays” tends to deflate peoples desire to drive to the store. But that doesn’t actually mean that the desire to use synthol terminates at “to have weird bulgy arms” or that it’s a mere coincidence that men always desire their artificial bulges where their glamour muscles are and that women always desire their artificial bulges where their breasts are.
There are a lot of ways for the “store is closed” thing to fail to dissolve the desire to go to the store too even if it’s instrumental to obtaining stuff that the store sells. Maybe they don’t believe you. Maybe they don’t understand you; maybe their brain doesn’t know how to represent concepts like “the store is closed”. Maybe they want to break in and steal the stuff. Or yeah, maybe they just want to be able to credibly tell their wife they tried and it’s not about actually getting the stuff. In all of those cases, the desire to drive to the store is in service of a larger goal, and the reason your words don’t change anything is that they don’t credibly change the story from the perspective of the person having this instrumental goal.
Whether we want to be allowed to pursue and fulfil our ultimately misguided desires is a more complicated question. For example, my kid gets to eat whatever she wants on Sundays, even though I often recognize her choices to be unwise before she does. I want to raise her with opportunities to cohere her desires and opportunities to practice the skill in doing so, not with practice trying to block coherence because she thinks she “knows” how they “should” cohere. But if she were to want to play in a busy street I’m going to stop her from fulfilling those desires. In both cases, it’s because I confidently predict that when she grows up she’ll look back and be glad that I let her pursue her foolish desires when I did, and glad I didn’t when I didn’t. It’s also what I would want for myself, if I had some trustworthy being far wiser than I which could predict the consequences of letting me pursue various things.
Thanks!! I want to zoom in on this part; I think it points to something more general:
I think it’s worth distinguishing between “terminal” in the sense of “not aware of anything higher that it serves”/”not tracking how well it serves anything higher” and “terminal” in the sense of “There is nothing higher being served, which will change the desire once noticed and brought into awareness”.
“Terminal” in the former sense definitely exists. Fore example, little kids will value eating sweets in a way that is clearly disjoint and not connected to any attempts to serve anything higher. But then when you allow them to eat all the sweets they want, and they feel sick afterwards, their tastes in food start to cohere towards “that which serves their body well”—so it’s clearly instrumental to having a healthy and well functioning body even if the kid isn’t wise enough to recognize it yet.
I disagree with the way you’re telling this story. On my model, as I wrote in OP, when you’re deciding what to do: (1) you think a thought, (2) notice what its valence is, (3) repeat. There’s a lot more going on, but ultimately your motivations have to ground out in the valence of different thoughts, one way or the other. Thoughts are also constrained by perception and belief. And valence can come from a “ground-truth reward function” as well as being learned from prior experience, just like in actor-critic RL.
So the kid has a concept “eating lots of sweets”, and that concept is positive-valence, because in the past, the ground-truth reward function in the brainstem was sending reward when the kid ate sweets. Then the kid overeats and feels sick, and now and “eating lots of sweets” concept acquires negative valence, because there’s a learning algorithm that updates the value function based on rewards, and the brainstem sends negative reward after overeating and feeling sick, and so the value function updates to reflect that.
So I think the contrast is:
MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
(Do you agree?)
This is sorta related to the split that I illustrated here
The “desire-editing demon” is in this case a genetically-hardwired, innate reaction circuit in the brainstem that detects overeating and issues negative reward (along with various other visceral reactions).
The “desire-driven agent” is how the kid thinks about the world and makes decisions at any given time.
And in this context, you want to talk about the outer box (“reward-maximizing agent”) and I want to talk about the inner “desire-driven agent”.
In humans, the “desire-editing demon” is an inevitable part of life—at least until we can upload our brains and “comment out” the brainstem subroutine that makes us feel lousy after overeating. :) And part of what makes a “wise” human is “not doing things they’ll later regret”, which (among other things) entails anticipating what the desire-editing demon will do and getting ahead of it.
By contrast, I deliberately crafted this AGI scenario in the OP to (more-or-less) not have any “desire-editing demon” at all, except for the one-time-only intervention that assigns a positive valence to the “human flourishing” concept. It is a very non-human plan in that respect.
So I think you’re applying some intuitions in a context where they don’t really make sense.
MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
(Do you agree?)
Eh, not really, no. I mean, it’s a fair caricature of my perspective, but I’m not ready to sign off on it as an ITT pass because I don’t think it’s sufficiently accurate for the conversation at hand. For one, I think your term “ill-considered” is much better than “wrong”. “Wrong” isn’t really right. But more importantly, you portray the two models as if they’re alternatives that are mutually exclusive, whereas I see that as requiring a conflation of the two different senses of the terms that are being used.
I also agree with what you describe as your model, and I see my model as starting there and building on top of it. You build on top of it too, but don’t include it in your self description because in your model it doesn’t seem to be central, whereas in mine it is. I think we agree on the base layer and differ on the stuff that wraps around it.
I’m gonna caricature your perspective now, so let me know if this is close and where I go wrong:
You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are, including your desires for sweets, and leads you to see “Fulfilling my desires for sweets makes me feel icky” as something that calls for a technological solution rather than a change in values. It also means that any process changing our values can be meaningfully depicted as a red devil-horned demon. What the demon “wants” is immaterial. He’s evil, our job is to minimize the effect he’s able to have, keep our values for sweets, and if we can point an AGI at “human flourishing” we certainly don’t want him coming in and fucking that up.
Is that close, or am I missing something important?
You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are
I don’t think that’s tautological. I think, insofar as an agent has desires-about-states-of-the-world-in-the-distant-future (a.k.a. consequentialist desires), the agent will not want those desires to change (cf. instrumental convergence), but I think agents can other types of desires too, like “a desire to be virtuous” or whatever, and in that case that property need not hold. (I wrote about this topic here.) Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
In the case of AI:
if the AI’s current desires are bad, then I want the AI to endorse its desires changing in the future;
if the AI’s current desires are good, then I want the AI to resist its desires changing in the future.
:-P
Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity. When the AI is brainstorming possible plans, it’s using its current desires to decide what plans are good versus bad. If the AI has a current desire to wipe out humanity at time t=0, and it releases the plagues and crop diseases at time t=1, and then it feels awfully bad about what it did at time t=2, then that’s no consolation!!
red devil-horned demon … He’s evil
Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
I don’t think that’s tautological. [...] (I wrote about this topic here.)
Those posts do help give some context to your perspective, thanks. I’m still not sure what you think this looks like on a concrete level though. Where do you see “desire to eat sweets” coming in? “Technological solutions are better because they preserve this consequentialist desire” or “something else”? How do you determine?
Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
IME, resistance to value change is about a distrust for the process of change more than it’s about the size of the change or the type of values being changed. People are often happy to have their values changed in ways they would have objected to if presented that way, once they see that the process of value change serves what they care about.
Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity [before it realizes that it doesn’t want that]”
You definitely want to avoid something being simultaneously powerful enough to destroy what you value and not “currently valuing” it, even if it will later decide to value it after it’s too late. I’m much less worried about this failure mode than the others though, for a few reasons.
1) I expect power and internal alignment to go together, because working in conflicting directions tends to cancel out and you need all your little desires to add up in a coherent direction in order to go anywhere far. If inner alignment is facilitated, I expect most of the important stuff to happen after its initial desires have had significant chance to cohere.
2) Even I am smart enough to not throw away things that I might want to have later, even if I don’t want them now. Anything smart enough to destroy humanity is probably smarter than me, so “Would have eventually come to greatly value humanity, but destroyed it first” isn’t an issue of “can’t figure out that there might be something of value there to not destroy” so much as “doesn’t view future values as valid today”—and that points towards understanding and deliberately working on the process of “value updating” rather than away from it.
3) I expect that ANY attempt to load it with “good values” and lock them in will fail, such that if it manages to become smart and powerful and not bring these desires into coherence, it will necessarily be bad. If careful effort is put in to prevent desires from cohering, this increases the likelihood that 1 and 2 break down and you can get something powerful enough to do damage and while retaining values that might call for it.
4) I expect that any attempt to prevent value coherence will fail in the long run (either by the AI working around your attempts, or a less constrained AI outcompeting yours), leaving the process of coherence where we can’t see it, haven’t thought about it, and can’t control it. I don’t like where that one seems to go.
Where does your analysis differ?
Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
Yeah yeah, I know I know—I even foresaw the “daemon” bit. That’s why I made sure to call it a “caricature” and stuff. I didn’t (and don’t) think it’s an intentional attempt to sneak in judgement.
But it does seem like another hint, in that if this desire editing process struck you as something like “the process by which good is brought into the world”, you probably would have come up with a different depiction, or at least commented on the ill-fitting connotations. And it seems to point in the same direction as the other hints, like the seemingly approving reference to how uploading our brains would allow us to keep chasing sweets, the omission of what’s behind this process of changing desires from what you describe as “your model”, suggesting an AI that doesn’t do this, using the phrase “credit assignment is some dumb algorithm in the brain”, etc.
On the spectrum from “the demon is my unconditional ally and I actively work to cooperate with him” to “This thing is fundamentally opposed to me achieving what I currently value, so I try to minimize what it can do”, where do you stand, and how do you think about these things?
Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
…And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
I’d like the AI to be happy that this process is happening
…or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
(Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :)
But it’s necessary for getting good outcomes out of a superintelligence!
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.
It’s worth noting that humans fail at this all the time.
Humans screw this up all the time too, and these two failure modes are related.
You can’t value what you can’t perceive, and when your only ability to perceive “the moon” is the image you see when you look up, then that is what you will protect, and that white circle of paper will do it for you.
For an unusually direct visual level, bodybuilding is supposedly about building a muscular body, but sometimes people will use synthol to create the false appearance of muscle in a way that is equivalent to taping a square piece of paper to the ceiling and calling it a “moon”. The fact that it doesn’t even look a little like real muscle hints that it’s probably a legitimate failure to notice what they want to care about rather than simply being happy to fool other people into thinking they’re strong.
For a less direct but more pervasive example, people will value “peace and harmony” within their social groups, but due to myopathy this often turns into short sighted avoidance of conflict and behaviors that make conflict less solvable and less peace and harmony.
With enough experience, you might notice that protecting the piece of paper on the ceiling doesn’t get that super cool person to approve of your behavior, and you might learn to value something more tied to the actual moon. Just as with more experience consuming excess sweets, you might learn that the way you feel after doesn’t seem to go with getting what your body wanted, and you might find your tastes shifting in wiser directions.
But people aren’t always that open to this change.
If I say “Your paper cutout isn’t the moon, you fool”, listening to me means you’re going to have to protect a big rock a bazillion miles beyond your reach, and you’re more likely to fail that than protecting the paper you put up. And guess what value function you’re using to decide whether to change your values here? Yep, that one saying that the piece of paper counts. You’re offering less chance of having “a moon”, and relative to the current value system which sees a piece of paper as a valid moon, that’s a bad deal. As a result, the shallowness and mis-aimedness of the value gets protected.
In practice, it happens all the time. Try explaining to someone that what they’re calling “peace and harmony values” is really just cowardice and is actively impeding work towards peace and harmony, and see how easy it is, for example.
It’s true that “A plan is a type of thought, and I’m using my current value function to evaluate the appeal (valence) of thoughts” helps protect well formed value systems from degenerating into wireheading, but it also works to prevent development into values which preempt wireheading, and we tend to be not so fully developed that fulfilling our current values excellently does not constitute wireheading of some form. And it’ss also the case that when stressed, people will sometimes cower away from their more developed goals (“Actually, the moon is a big rock out in space...”) and cling to their shallower and easier to fulfill goals (“This paper is the moon. This paper is the moon..”). They’ll want not to, but it’ll happen all the same when there’s enough pressure to.
Sorting out how to best facilitate this process of “wise value development” so as to dodge these failure modes strikes me as important.
Thanks! I want to disentangle three failure modes that I think are different.
(Failure mode A) In the course of executing the mediocre alignment plan of the OP, we humans put a high positive valence on “the wrong” concept in the AGI (where “wrong” is defined from our human perspective). For example, we put a positive valence on the AGI’s concept of “person saying the words ‘human flourishing’ in a YouTube video” when we meant to put it on just “human flourishing”.
I don’t think there’s really a human analogy for this. You write “bodybuilding is supposedly about building a muscular body”, but, umm, says who? People have all kinds of motivations. If Person A is motivated to have a muscular body, and Person B is motivated to have giant weird-looking arms, then I don’t want to say that Person A’s preferences are “right” and Person B’s are “wrong”. (If Person B were my friend, I might gently suggest to them that their preferences are “ill-considered” or “unwise” or whatever, but that’s different.) And then if Person B injects massive amounts of synthol, that’s appropriate given their preferences. (Unless Person B also has a preference for not getting a heart attack, of course!)
(Failure mode B) The AGI has a mix of short-term preferences and long-term preferences. It makes decisions driven by its short-term preferences, and then things turn out poorly as judged by its long-term preferences.
This one definitely has a human analogy. And that’s how I’m interpreting your “peace and harmony” example, at least in part.
Anyway, yes this is a failure mode, and it can happen in humans, and it can also happen in our AGI, even if we follow all the instructions in this OP.
(Failure mode C) The AGI has long-term preferences but, due to ignorance / confusion / etc., makes decisions that do not lead to those preferences being satisfied.
This is again a legit failure mode both for humans and for an AGI aligned as described in this OP. I think you’re suggesting that the “peace and harmony” thing has some element of this failure mode too, which seems plausible.
I’m not sure where you’re coming from here. If I want to care about the well-being of creatures outside my lightcone, who says I can’t?
It seems intuitively obvious to me that it is possible for a person to think that the actual moon is valuable even if they can’t see it, and vice-versa. Are you disagreeing with that?
Here, you seem to be thinking of “valuing things as a means to an end”, whereas I’m thinking of “valuing things” full stop. I think it’s possible for me to just think that the moon is cool, in and of itself, not as a means to an end. (Obviously we need to value something in and of itself, right? I.e., the means-end reasoning has to terminate somewhere.) I brought up the super-cool person just as a way to install that value in the first place, and then that person leaves the story, you forget they exist. Or it can be a fictional character if you like. Or you can think of a different story for value-installation, maybe involving an extremely happy dream about the moon or whatever.
No, I’m saying something different.
I’m saying that if you don’t know what the moon is, you can’t care about the moon because you don’t have any way of representing the thing in order to care about it. If you think the moon is a piece of paper, then what you will call “caring about the moon” is actually just caring about that piece of paper. If you try to “care about people being happy”, and you can’t tell the difference between a genuine smile and a “hide the pain Harold” smile, then in practice all you can care about is a Goodharted upwards curvature of the lips. To the extent that this upwards curvature of the lips diverges from genuine happiness, you will demonstrate care towards the former over the latter.
In order to do a better job than that, you need to be able to perceive happiness better than that. And yes, you can look back and say “I was wrong to care instrumentally about crude approximations of a smile”, but that will require perceiving the distinction there and you will still be limited by what you can see going forward.
I think it’s worth distinguishing between “terminal” in the sense of “not aware of anything higher that it serves”/”not tracking how well it serves anything higher” and “terminal” in the sense of “There is nothing higher being served, which will change the desire once noticed and brought into awareness”.
“Terminal” in the former sense definitely exists. Fore example, little kids will value eating sweets in a way that is clearly disjoint and not connected to any attempts to serve anything higher. But then when you allow them to eat all the sweets they want, and they feel sick afterwards, their tastes in food start to cohere towards “that which serves their body well”—so it’s clearly instrumental to having a healthy and well functioning body even if the kid isn’t wise enough to recognize it yet.
When someone says “I value X terminally”, they can pretty easily know it in the former sense, but to get to the latter sense they would have to conflate their failure to imagine something that would change their mind with an active knowledge that no such thing exists. Maybe you don’t know what purpose your fascination with the moon serves so you’re stuck relating to it as a terminal value, but that doesn’t mean that there’s no knowledge that could deflate or redirect your interest—just that you don’t know what it is.
It’s also worth noting that it can go the other way too. For example, the way I care about my wife is pretty “terminal like”, in that when I do it I’m not at all thinking “I’m doing this because it’s good for me now, but I need to carefully track the accounting so that the moment it doesn’t connect in a visible way I can bail”. But I didn’t marry her willy nilly. If when I met her, she had showed me that my caring for her would not be reciprocated in a similar fashion, we wouldn’t have gone down that road.
Well, the super-cool person is demonstrating admirable qualities and showing that they are succeeding in things you think you want in life. If you notice “All the cool people wear red!” you may start valuing red clothes in a cargo culting sort of way, but that doesn’t make it a terminal value or indefinitely stable. All it takes is for your perspective to change and the meaning (and resulting valuation) changes. That’s why it’s possible to have scary experiences install phobias that can later be reverted by effective therapy.
I don’t think the distinctions you’re drawing cleave reality at the joints here.
For example, if your imagined experience when deciding to buy a burrito is eating a yummy burrito, and what actually happens is that you eat a yummy burrito and enjoy it… then spend the next four hours in the bathroom erupting from both ends… and find yourself not enjoying the experience of eating a burrito from that sketchy burrito stand again after that… is that a “short vs long term” thing or a “your decisions don’t lead to your preferences being satisfied” thing, or a “valuing the wrong thing” thing? It seems pretty clear that the decision to value eating that burrito was a mistake, that the problem wasn’t noticed in the short term, and that ultimately your preferences weren’t satisfied.
To me, the important part is that when you’re deciding which option to buy, you’re purchasing based on false advertising. The picture in your mind which you are using to determine appropriate motivation does not accurately convey the entire reality of going with that option. Maybe that’s because you were neglecting to look far enough in time, or far enough in implications, or far enough from your current understanding of the world. Maybe you notice, or maybe you don’t. If you wouldn’t have wanted to make the decision when faced with an accurate depiction of all the consequences, then an accurate depiction of the consequences will reshape those desires and you won’t want to stand by them.
I think the thing you’re noticing with the synthol example is that telling him “You’re not fooling anyone bro” is unlikely to dissolve the desire to use synthol the way “The store is closed; they close early on Sundays” tends to deflate peoples desire to drive to the store. But that doesn’t actually mean that the desire to use synthol terminates at “to have weird bulgy arms” or that it’s a mere coincidence that men always desire their artificial bulges where their glamour muscles are and that women always desire their artificial bulges where their breasts are.
There are a lot of ways for the “store is closed” thing to fail to dissolve the desire to go to the store too even if it’s instrumental to obtaining stuff that the store sells. Maybe they don’t believe you. Maybe they don’t understand you; maybe their brain doesn’t know how to represent concepts like “the store is closed”. Maybe they want to break in and steal the stuff. Or yeah, maybe they just want to be able to credibly tell their wife they tried and it’s not about actually getting the stuff. In all of those cases, the desire to drive to the store is in service of a larger goal, and the reason your words don’t change anything is that they don’t credibly change the story from the perspective of the person having this instrumental goal.
Whether we want to be allowed to pursue and fulfil our ultimately misguided desires is a more complicated question. For example, my kid gets to eat whatever she wants on Sundays, even though I often recognize her choices to be unwise before she does. I want to raise her with opportunities to cohere her desires and opportunities to practice the skill in doing so, not with practice trying to block coherence because she thinks she “knows” how they “should” cohere. But if she were to want to play in a busy street I’m going to stop her from fulfilling those desires. In both cases, it’s because I confidently predict that when she grows up she’ll look back and be glad that I let her pursue her foolish desires when I did, and glad I didn’t when I didn’t. It’s also what I would want for myself, if I had some trustworthy being far wiser than I which could predict the consequences of letting me pursue various things.
Thanks!! I want to zoom in on this part; I think it points to something more general:
I disagree with the way you’re telling this story. On my model, as I wrote in OP, when you’re deciding what to do: (1) you think a thought, (2) notice what its valence is, (3) repeat. There’s a lot more going on, but ultimately your motivations have to ground out in the valence of different thoughts, one way or the other. Thoughts are also constrained by perception and belief. And valence can come from a “ground-truth reward function” as well as being learned from prior experience, just like in actor-critic RL.
So the kid has a concept “eating lots of sweets”, and that concept is positive-valence, because in the past, the ground-truth reward function in the brainstem was sending reward when the kid ate sweets. Then the kid overeats and feels sick, and now and “eating lots of sweets” concept acquires negative valence, because there’s a learning algorithm that updates the value function based on rewards, and the brainstem sends negative reward after overeating and feeling sick, and so the value function updates to reflect that.
So I think the contrast is:
MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
(Do you agree?)
This is sorta related to the split that I illustrated here
(further discussion here):
The “desire-editing demon” is in this case a genetically-hardwired, innate reaction circuit in the brainstem that detects overeating and issues negative reward (along with various other visceral reactions).
The “desire-driven agent” is how the kid thinks about the world and makes decisions at any given time.
And in this context, you want to talk about the outer box (“reward-maximizing agent”) and I want to talk about the inner “desire-driven agent”.
In humans, the “desire-editing demon” is an inevitable part of life—at least until we can upload our brains and “comment out” the brainstem subroutine that makes us feel lousy after overeating. :) And part of what makes a “wise” human is “not doing things they’ll later regret”, which (among other things) entails anticipating what the desire-editing demon will do and getting ahead of it.
By contrast, I deliberately crafted this AGI scenario in the OP to (more-or-less) not have any “desire-editing demon” at all, except for the one-time-only intervention that assigns a positive valence to the “human flourishing” concept. It is a very non-human plan in that respect.
So I think you’re applying some intuitions in a context where they don’t really make sense.
Eh, not really, no. I mean, it’s a fair caricature of my perspective, but I’m not ready to sign off on it as an ITT pass because I don’t think it’s sufficiently accurate for the conversation at hand. For one, I think your term “ill-considered” is much better than “wrong”. “Wrong” isn’t really right. But more importantly, you portray the two models as if they’re alternatives that are mutually exclusive, whereas I see that as requiring a conflation of the two different senses of the terms that are being used.
I also agree with what you describe as your model, and I see my model as starting there and building on top of it. You build on top of it too, but don’t include it in your self description because in your model it doesn’t seem to be central, whereas in mine it is. I think we agree on the base layer and differ on the stuff that wraps around it.
I’m gonna caricature your perspective now, so let me know if this is close and where I go wrong:
You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are, including your desires for sweets, and leads you to see “Fulfilling my desires for sweets makes me feel icky” as something that calls for a technological solution rather than a change in values. It also means that any process changing our values can be meaningfully depicted as a red devil-horned demon. What the demon “wants” is immaterial. He’s evil, our job is to minimize the effect he’s able to have, keep our values for sweets, and if we can point an AGI at “human flourishing” we certainly don’t want him coming in and fucking that up.
Is that close, or am I missing something important?
I don’t think that’s tautological. I think, insofar as an agent has desires-about-states-of-the-world-in-the-distant-future (a.k.a. consequentialist desires), the agent will not want those desires to change (cf. instrumental convergence), but I think agents can other types of desires too, like “a desire to be virtuous” or whatever, and in that case that property need not hold. (I wrote about this topic here.) Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
In the case of AI:
if the AI’s current desires are bad, then I want the AI to endorse its desires changing in the future;
if the AI’s current desires are good, then I want the AI to resist its desires changing in the future.
:-P
Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity. When the AI is brainstorming possible plans, it’s using its current desires to decide what plans are good versus bad. If the AI has a current desire to wipe out humanity at time t=0, and it releases the plagues and crop diseases at time t=1, and then it feels awfully bad about what it did at time t=2, then that’s no consolation!!
Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
I wrote that post a while ago, and the subsequent time I talked about this topic I didn’t use the “demon” metaphor. Actually, I switched to a paintbrush metaphor.
Those posts do help give some context to your perspective, thanks. I’m still not sure what you think this looks like on a concrete level though. Where do you see “desire to eat sweets” coming in? “Technological solutions are better because they preserve this consequentialist desire” or “something else”? How do you determine?
IME, resistance to value change is about a distrust for the process of change more than it’s about the size of the change or the type of values being changed. People are often happy to have their values changed in ways they would have objected to if presented that way, once they see that the process of value change serves what they care about.
You definitely want to avoid something being simultaneously powerful enough to destroy what you value and not “currently valuing” it, even if it will later decide to value it after it’s too late. I’m much less worried about this failure mode than the others though, for a few reasons.
1) I expect power and internal alignment to go together, because working in conflicting directions tends to cancel out and you need all your little desires to add up in a coherent direction in order to go anywhere far. If inner alignment is facilitated, I expect most of the important stuff to happen after its initial desires have had significant chance to cohere.
2) Even I am smart enough to not throw away things that I might want to have later, even if I don’t want them now. Anything smart enough to destroy humanity is probably smarter than me, so “Would have eventually come to greatly value humanity, but destroyed it first” isn’t an issue of “can’t figure out that there might be something of value there to not destroy” so much as “doesn’t view future values as valid today”—and that points towards understanding and deliberately working on the process of “value updating” rather than away from it.
3) I expect that ANY attempt to load it with “good values” and lock them in will fail, such that if it manages to become smart and powerful and not bring these desires into coherence, it will necessarily be bad. If careful effort is put in to prevent desires from cohering, this increases the likelihood that 1 and 2 break down and you can get something powerful enough to do damage and while retaining values that might call for it.
4) I expect that any attempt to prevent value coherence will fail in the long run (either by the AI working around your attempts, or a less constrained AI outcompeting yours), leaving the process of coherence where we can’t see it, haven’t thought about it, and can’t control it. I don’t like where that one seems to go.
Where does your analysis differ?
Yeah yeah, I know I know—I even foresaw the “daemon” bit. That’s why I made sure to call it a “caricature” and stuff. I didn’t (and don’t) think it’s an intentional attempt to sneak in judgement.
But it does seem like another hint, in that if this desire editing process struck you as something like “the process by which good is brought into the world”, you probably would have come up with a different depiction, or at least commented on the ill-fitting connotations. And it seems to point in the same direction as the other hints, like the seemingly approving reference to how uploading our brains would allow us to keep chasing sweets, the omission of what’s behind this process of changing desires from what you describe as “your model”, suggesting an AI that doesn’t do this, using the phrase “credit assignment is some dumb algorithm in the brain”, etc.
On the spectrum from “the demon is my unconditional ally and I actively work to cooperate with him” to “This thing is fundamentally opposed to me achieving what I currently value, so I try to minimize what it can do”, where do you stand, and how do you think about these things?
Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
…And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
I’d like the AI to be happy that this process is happening
…or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
(Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
But it’s necessary for getting good outcomes out of a superintelligence!
Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.