Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
…And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
I’d like the AI to be happy that this process is happening
…or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
(Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :)
But it’s necessary for getting good outcomes out of a superintelligence!
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.
Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire:
I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part:
I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon!
I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon!
…And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon!
Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans.
Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is:
I’d like the AI to be happy that this process is happening
…or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process.
I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world.
(Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about.
I have two categories of alignment plans that I normally think about (see here): (A) this OP [and related things in that genre], (B) to give the AI a reward function vaguely inspired by human social instincts and hope it gradually becomes “nice” through a more human-like process.
In the OP (A), the exogenous / demon desire-editing process happens as a one-time intervention, and a lot of this discussion about desire-editing seems moot. In (B), it’s more of a continual process and we need to think carefully about how to make sure that the AI is in fact happy that the process is happening. It’s not at all clear to me how to intervene on an AI’s meta-preferences, directly or indirectly. (I’m hoping that I’ll feel less confused if I can reach a better understanding of how social instincts work in humans.)
But it’s necessary for getting good outcomes out of a superintelligence!
Makes sense. I think I have a somewhat better idea of how you see the demon thing now.
I disagree with bad demon here. I’ve used nicotine for that purpose and it didn’t feel like much of a threat, but my experience with opioids did have enough of a tug that it scared me away from doing it a second time. After more time for the demon to work though, I don’t find the idea appealing anymore and I’m pretty confident that I wouldn’t be tempted even if I took some again. You just don’t want to get stuck between the update of “Ooh, this stuff feels really good” and the update of “It’s not though, lol. It’s a lie, and and chasing it leads to ruin. How tempting is it to ruin your life chasing a lie?”. It’s a “valley of bad rationality” problem, if you lack the foresight to avoid it.
I don’t think you can actually get away from it. For one, you can’t design an AI to give you what you want if you don’t know what you want—and you don’t know what you want unless you’re aligned yourself. If you understand the process of human alignment, then you can conceivably create an AI which will help you along in the right direction. If you don’t have that, even if you manage to manage to hit what you’re aiming at you’re likely to be a somewhat more sophisticated version of a dope fiend aiming for more dope—and get the resulting outcomes. Because of Goodhart’s law, “using AI to get what I already know I want” falls apart once AI becomes sufficiently powerful.
For two, I don’t think anyone has anywhere near good enough idea about how alignment works in general that it makes sense to neglect the one example we have a lot of experience with and easy ability to experiment with. It’s one thing to not trap yourself in the ornithopter box, but wings are everywhere for a reason, and until you understand that and have a solid understanding of aerodynamics and have better flying machines than birds, it is premature neglect to study what’s going on with bird wings. Even with a pretty solid understanding of aerodynamics, studying birds gives some neat solutions to things like adverse yaw and ideal lift distributions. You seem to be getting at this at the end of your comment.
For three, if we’re talking about “brain like” AGI and training them in a ways analogous to getting a kid to be a moon fan, it’s important to understand what is actually happening when a kid becomes a fan of “the moon” and where that’s likely to go wrong. The AI we have now are remarkably human in their training process and failures so unless we take a massive departure from this, understanding how human alignment works is directly relevant.