MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
(Do you agree?)
Eh, not really, no. I mean, it’s a fair caricature of my perspective, but I’m not ready to sign off on it as an ITT pass because I don’t think it’s sufficiently accurate for the conversation at hand. For one, I think your term “ill-considered” is much better than “wrong”. “Wrong” isn’t really right. But more importantly, you portray the two models as if they’re alternatives that are mutually exclusive, whereas I see that as requiring a conflation of the two different senses of the terms that are being used.
I also agree with what you describe as your model, and I see my model as starting there and building on top of it. You build on top of it too, but don’t include it in your self description because in your model it doesn’t seem to be central, whereas in mine it is. I think we agree on the base layer and differ on the stuff that wraps around it.
I’m gonna caricature your perspective now, so let me know if this is close and where I go wrong:
You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are, including your desires for sweets, and leads you to see “Fulfilling my desires for sweets makes me feel icky” as something that calls for a technological solution rather than a change in values. It also means that any process changing our values can be meaningfully depicted as a red devil-horned demon. What the demon “wants” is immaterial. He’s evil, our job is to minimize the effect he’s able to have, keep our values for sweets, and if we can point an AGI at “human flourishing” we certainly don’t want him coming in and fucking that up.
Is that close, or am I missing something important?
You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are
I don’t think that’s tautological. I think, insofar as an agent has desires-about-states-of-the-world-in-the-distant-future (a.k.a. consequentialist desires), the agent will not want those desires to change (cf. instrumental convergence), but I think agents can other types of desires too, like “a desire to be virtuous” or whatever, and in that case that property need not hold. (I wrote about this topic here.) Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
In the case of AI:
if the AI’s current desires are bad, then I want the AI to endorse its desires changing in the future;
if the AI’s current desires are good, then I want the AI to resist its desires changing in the future.
:-P
Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity. When the AI is brainstorming possible plans, it’s using its current desires to decide what plans are good versus bad. If the AI has a current desire to wipe out humanity at time t=0, and it releases the plagues and crop diseases at time t=1, and then it feels awfully bad about what it did at time t=2, then that’s no consolation!!
red devil-horned demon … He’s evil
Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
I don’t think that’s tautological. [...] (I wrote about this topic here.)
Those posts do help give some context to your perspective, thanks. I’m still not sure what you think this looks like on a concrete level though. Where do you see “desire to eat sweets” coming in? “Technological solutions are better because they preserve this consequentialist desire” or “something else”? How do you determine?
Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
IME, resistance to value change is about a distrust for the process of change more than it’s about the size of the change or the type of values being changed. People are often happy to have their values changed in ways they would have objected to if presented that way, once they see that the process of value change serves what they care about.
Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity [before it realizes that it doesn’t want that]”
You definitely want to avoid something being simultaneously powerful enough to destroy what you value and not “currently valuing” it, even if it will later decide to value it after it’s too late. I’m much less worried about this failure mode than the others though, for a few reasons.
1) I expect power and internal alignment to go together, because working in conflicting directions tends to cancel out and you need all your little desires to add up in a coherent direction in order to go anywhere far. If inner alignment is facilitated, I expect most of the important stuff to happen after its initial desires have had significant chance to cohere.
2) Even I am smart enough to not throw away things that I might want to have later, even if I don’t want them now. Anything smart enough to destroy humanity is probably smarter than me, so “Would have eventually come to greatly value humanity, but destroyed it first” isn’t an issue of “can’t figure out that there might be something of value there to not destroy” so much as “doesn’t view future values as valid today”—and that points towards understanding and deliberately working on the process of “value updating” rather than away from it.
3) I expect that ANY attempt to load it with “good values” and lock them in will fail, such that if it manages to become smart and powerful and not bring these desires into coherence, it will necessarily be bad. If careful effort is put in to prevent desires from cohering, this increases the likelihood that 1 and 2 break down and you can get something powerful enough to do damage and while retaining values that might call for it.
4) I expect that any attempt to prevent value coherence will fail in the long run (either by the AI working around your attempts, or a less constrained AI outcompeting yours), leaving the process of coherence where we can’t see it, haven’t thought about it, and can’t control it. I don’t like where that one seems to go.
Where does your analysis differ?
Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
Yeah yeah, I know I know—I even foresaw the “daemon” bit. That’s why I made sure to call it a “caricature” and stuff. I didn’t (and don’t) think it’s an intentional attempt to sneak in judgement.
But it does seem like another hint, in that if this desire editing process struck you as something like “the process by which good is brought into the world”, you probably would have come up with a different depiction, or at least commented on the ill-fitting connotations. And it seems to point in the same direction as the other hints, like the seemingly approving reference to how uploading our brains would allow us to keep chasing sweets, the omission of what’s behind this process of changing desires from what you describe as “your model”, suggesting an AI that doesn’t do this, using the phrase “credit assignment is some dumb algorithm in the brain”, etc.
On the spectrum from “the demon is my unconditional ally and I actively work to cooperate with him” to “This thing is fundamentally opposed to me achieving what I currently value, so I try to minimize what it can do”, where do you stand, and how do you think about these things?
Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
…And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
I’d like the AI to be happy that this process is happening
…or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
(Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :)
But it’s necessary for getting good outcomes out of a superintelligence!
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.
Eh, not really, no. I mean, it’s a fair caricature of my perspective, but I’m not ready to sign off on it as an ITT pass because I don’t think it’s sufficiently accurate for the conversation at hand. For one, I think your term “ill-considered” is much better than “wrong”. “Wrong” isn’t really right. But more importantly, you portray the two models as if they’re alternatives that are mutually exclusive, whereas I see that as requiring a conflation of the two different senses of the terms that are being used.
I also agree with what you describe as your model, and I see my model as starting there and building on top of it. You build on top of it too, but don’t include it in your self description because in your model it doesn’t seem to be central, whereas in mine it is. I think we agree on the base layer and differ on the stuff that wraps around it.
I’m gonna caricature your perspective now, so let me know if this is close and where I go wrong:
You see the statement of “I don’t want my values to change because that means I’d optimize for something other than my [current] values” as a thing that tautologically applies to whatever your values are, including your desires for sweets, and leads you to see “Fulfilling my desires for sweets makes me feel icky” as something that calls for a technological solution rather than a change in values. It also means that any process changing our values can be meaningfully depicted as a red devil-horned demon. What the demon “wants” is immaterial. He’s evil, our job is to minimize the effect he’s able to have, keep our values for sweets, and if we can point an AGI at “human flourishing” we certainly don’t want him coming in and fucking that up.
Is that close, or am I missing something important?
I don’t think that’s tautological. I think, insofar as an agent has desires-about-states-of-the-world-in-the-distant-future (a.k.a. consequentialist desires), the agent will not want those desires to change (cf. instrumental convergence), but I think agents can other types of desires too, like “a desire to be virtuous” or whatever, and in that case that property need not hold. (I wrote about this topic here.) Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).
In the case of AI:
if the AI’s current desires are bad, then I want the AI to endorse its desires changing in the future;
if the AI’s current desires are good, then I want the AI to resist its desires changing in the future.
:-P
Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity. When the AI is brainstorming possible plans, it’s using its current desires to decide what plans are good versus bad. If the AI has a current desire to wipe out humanity at time t=0, and it releases the plagues and crop diseases at time t=1, and then it feels awfully bad about what it did at time t=2, then that’s no consolation!!
Oh c’mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.
I wrote that post a while ago, and the subsequent time I talked about this topic I didn’t use the “demon” metaphor. Actually, I switched to a paintbrush metaphor.
Those posts do help give some context to your perspective, thanks. I’m still not sure what you think this looks like on a concrete level though. Where do you see “desire to eat sweets” coming in? “Technological solutions are better because they preserve this consequentialist desire” or “something else”? How do you determine?
IME, resistance to value change is about a distrust for the process of change more than it’s about the size of the change or the type of values being changed. People are often happy to have their values changed in ways they would have objected to if presented that way, once they see that the process of value change serves what they care about.
You definitely want to avoid something being simultaneously powerful enough to destroy what you value and not “currently valuing” it, even if it will later decide to value it after it’s too late. I’m much less worried about this failure mode than the others though, for a few reasons.
1) I expect power and internal alignment to go together, because working in conflicting directions tends to cancel out and you need all your little desires to add up in a coherent direction in order to go anywhere far. If inner alignment is facilitated, I expect most of the important stuff to happen after its initial desires have had significant chance to cohere.
2) Even I am smart enough to not throw away things that I might want to have later, even if I don’t want them now. Anything smart enough to destroy humanity is probably smarter than me, so “Would have eventually come to greatly value humanity, but destroyed it first” isn’t an issue of “can’t figure out that there might be something of value there to not destroy” so much as “doesn’t view future values as valid today”—and that points towards understanding and deliberately working on the process of “value updating” rather than away from it.
3) I expect that ANY attempt to load it with “good values” and lock them in will fail, such that if it manages to become smart and powerful and not bring these desires into coherence, it will necessarily be bad. If careful effort is put in to prevent desires from cohering, this increases the likelihood that 1 and 2 break down and you can get something powerful enough to do damage and while retaining values that might call for it.
4) I expect that any attempt to prevent value coherence will fail in the long run (either by the AI working around your attempts, or a less constrained AI outcompeting yours), leaving the process of coherence where we can’t see it, haven’t thought about it, and can’t control it. I don’t like where that one seems to go.
Where does your analysis differ?
Yeah yeah, I know I know—I even foresaw the “daemon” bit. That’s why I made sure to call it a “caricature” and stuff. I didn’t (and don’t) think it’s an intentional attempt to sneak in judgement.
But it does seem like another hint, in that if this desire editing process struck you as something like “the process by which good is brought into the world”, you probably would have come up with a different depiction, or at least commented on the ill-fitting connotations. And it seems to point in the same direction as the other hints, like the seemingly approving reference to how uploading our brains would allow us to keep chasing sweets, the omission of what’s behind this process of changing desires from what you describe as “your model”, suggesting an AI that doesn’t do this, using the phrase “credit assignment is some dumb algorithm in the brain”, etc.
On the spectrum from “the demon is my unconditional ally and I actively work to cooperate with him” to “This thing is fundamentally opposed to me achieving what I currently value, so I try to minimize what it can do”, where do you stand, and how do you think about these things?
Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
…And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
I’d like the AI to be happy that this process is happening
…or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
(Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
But it’s necessary for getting good outcomes out of a superintelligence!
Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.