The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.
Children aren’t superintelligent AGIs for which instrumental convergence applies.
A genealogical descendancy of agents creating/training agents is isomorphic to a single agent undergoing significant self-modification.
For current capability regimes, sure. In the future? Not so clear. Consequentialist is a more general idea.
How is ‘consequentialist’ more general? Do you have a practical example of a consequentalist agent that is different than a general model-based RL agent?
> The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.
Children aren’t superintelligent AGIs for which instrumental convergence applies.
I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.
If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem. For another thing, we’re kinda hazy on what future we want in the first place—I don’t think there’s an infinitesimal target that we need parts-per-googol accuracy to hit. For yet another thing, I do agree with Jacob that if we consider the fact “We’re OK bequeathing the universe to the next generation, even though we don’t really know what they’ll do with it” (assuming you are in fact on board with that, as I think I am and most people are, although I suppose one could say it’s just status quo bias), I think that’s a very interesting datapoint worth thinking about, and again hints that there may be approaches that don’t require parts-per-googol accuracy.
Normally in this kind of discussion I would be arguing the other side—I do think it will be awfully hard and perhaps impossible to get an AGI to wind up with motivations that are not catastrophically bad for humanity—but “it must be literally perfect” is going too far!
This argument about whether human-level alignment is sufficient is at least a decade old. I suspect one issue is that human inter-human alignment is high variance. The phrase “human-level alignment” could conjure up anything from Ghandi to Hitler, from Bob Ross to Jeffrey Dahmer. If you model that as an adversarial draw, it’s pretty bad. As a random draw, it may be better than default unaligned, but still high risk. I tend to view it more optimistically as an optimistic draw, based on reverse engineering human altruism to control/amplify it.
I thought LW/MIRI was generally pessimistic on human-level alignment, but Rob Bensinger said “If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk.” in this comment which was an update for me.
So as a result I tend to see brain reverse engineering as much higher priority than it otherwise would deserve, for both inspiring artificial empathy/altruism and also shortening the timeframe until uploading.
I tend to see brain reverse engineering as much higher priority than it otherwise would deserve, for both inspiring artificial empathy/altruism and also shortening the timeframe until uploading
My take is that the neocortex (and other bits) are running a quasi-general-purpose learning algorithm, and the hypothalamus and brainstem are “steering” that learning algorithm by sending multiple reward signals and other supervisory signals. (The latter are also doing lots of other species-specific instinct stuff that don’t interact with the learning algorithms, like regulating heart rate.)
So if we reverse-engineer the neocortex learning algorithm first, before learning anything new about the hypothalamus & brainstem, I think that we’d wind up with a recipe for making an AGI with radically alien motivations, but we still wouldn’t know how to make an AGI with human-like empathy / altruism.
I think there’s circuitry somewhere in the hypothalamus & brainstem that works in conjunction with the learning algorithms to create social instincts, and I’m strongly in favor of figuring out how those circuits work, and that’s one of the things that I’m trying to do myself. :-)
Yes, I concur. The cortex seems to get 90% or more of the attention in neuroscience, but those smaller more ancient central brain structures probably have more of the innate complexity relevant for the learning machinery. That’s on my reading list (along with some of your brain articles a friend recommended).
I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.
Oh, no, this wasn’t what I meant. I just meant that the usage of children as an example was poor because individual children don’t have the potential to succesfully seek vast power.
There certainly is a level of sufficient alignment of a just consequentialist utility function which looks like 1−ϵ as opposed to 1. I think this ϵ is pretty low, but I reiterate for ‘purely long-run consequentialists’. Note that ϵ must exceptionally low for this sort of AI not to seek power (assuming that avoiding power seeking is desired for the utility function, perhaps we are fine with power seeking as we have the desired consequentialist values, whatever those may be, locked in).
If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem
Agreed, but these aren’t consequentialist properties. At least that isn’t how I model them.
I shouldn’t have given such a vague response to the child metaphor.
A genealogical descendancy of agents creating/training agents is isomorphic to a single agent undergoing significant self-modification.
How is ‘consequentialist’ more general? Do you have a practical example of a consequentalist agent that is different than a general model-based RL agent?
I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.
If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem. For another thing, we’re kinda hazy on what future we want in the first place—I don’t think there’s an infinitesimal target that we need parts-per-googol accuracy to hit. For yet another thing, I do agree with Jacob that if we consider the fact “We’re OK bequeathing the universe to the next generation, even though we don’t really know what they’ll do with it” (assuming you are in fact on board with that, as I think I am and most people are, although I suppose one could say it’s just status quo bias), I think that’s a very interesting datapoint worth thinking about, and again hints that there may be approaches that don’t require parts-per-googol accuracy.
Normally in this kind of discussion I would be arguing the other side—I do think it will be awfully hard and perhaps impossible to get an AGI to wind up with motivations that are not catastrophically bad for humanity—but “it must be literally perfect” is going too far!
This argument about whether human-level alignment is sufficient is at least a decade old. I suspect one issue is that human inter-human alignment is high variance. The phrase “human-level alignment” could conjure up anything from Ghandi to Hitler, from Bob Ross to Jeffrey Dahmer. If you model that as an adversarial draw, it’s pretty bad. As a random draw, it may be better than default unaligned, but still high risk. I tend to view it more optimistically as an optimistic draw, based on reverse engineering human altruism to control/amplify it.
I thought LW/MIRI was generally pessimistic on human-level alignment, but Rob Bensinger said “If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk.” in this comment which was an update for me.
So as a result I tend to see brain reverse engineering as much higher priority than it otherwise would deserve, for both inspiring artificial empathy/altruism and also shortening the timeframe until uploading.
My take is that the neocortex (and other bits) are running a quasi-general-purpose learning algorithm, and the hypothalamus and brainstem are “steering” that learning algorithm by sending multiple reward signals and other supervisory signals. (The latter are also doing lots of other species-specific instinct stuff that don’t interact with the learning algorithms, like regulating heart rate.)
So if we reverse-engineer the neocortex learning algorithm first, before learning anything new about the hypothalamus & brainstem, I think that we’d wind up with a recipe for making an AGI with radically alien motivations, but we still wouldn’t know how to make an AGI with human-like empathy / altruism.
I think there’s circuitry somewhere in the hypothalamus & brainstem that works in conjunction with the learning algorithms to create social instincts, and I’m strongly in favor of figuring out how those circuits work, and that’s one of the things that I’m trying to do myself. :-)
Yes, I concur. The cortex seems to get 90% or more of the attention in neuroscience, but those smaller more ancient central brain structures probably have more of the innate complexity relevant for the learning machinery. That’s on my reading list (along with some of your brain articles a friend recommended).
Oh, no, this wasn’t what I meant. I just meant that the usage of children as an example was poor because individual children don’t have the potential to succesfully seek vast power. There certainly is a level of sufficient alignment of a just consequentialist utility function which looks like 1−ϵ as opposed to 1. I think this ϵ is pretty low, but I reiterate for ‘purely long-run consequentialists’. Note that ϵ must exceptionally low for this sort of AI not to seek power (assuming that avoiding power seeking is desired for the utility function, perhaps we are fine with power seeking as we have the desired consequentialist values, whatever those may be, locked in).
Agreed, but these aren’t consequentialist properties. At least that isn’t how I model them.
I shouldn’t have given such a vague response to the child metaphor.