I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever.
Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However…
GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes
I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in. That doesn’t involve any internal search over the human utility function embedded in GPT-4′s weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it’s pretty good at offering ethical advice in most situations that you’re ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won’t be long before GPT-N is about as good at knowing what’s ethical as your average human; maybe it’ll even be a bit more ethical.
(But yes, this isn’t the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)
I think [GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes] in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
I agree that MIRI people are interested in things like “what transhumanist utopia will the AI be motivated to build” but I think saying that this is the hard part of the value identification problem is pretty much just moving the goalposts from what I thought the original claim was. Very few, if any, humans can tell you exactly how to build the transhumanist utopia either. If the original thesis was “human values are hard to identify because it’s hard to extract all the nuances of value embedded in human brains”, now the thesis is becoming “human values are hard to identify because literally no one knows how to build the transhumanist utopia”.
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.)
moving the goalposts from what I thought the original claim was
I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.”
That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context).
Very few, if any, humans can tell you exactly how to build the transhumanist utopia either.
Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.)
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.)
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in.
AFAIK, GPT-4 is a mix of “extrapolating text-continuation patterns learned from the internet” + “RLHF based on labeled examples”.
For the former, I note that Eliezer commented in 2018 that “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.” It kinda sounds like Eliezer is most comfortable thinking of RL, and sees SL as kinda different, maybe? (I could talk about my models here, but that’s a different topic… Anyway, I’m not really sure what Eliezer thinks.)
For the latter, again I think it’s a question of whether we care about our ability to extrapolate the labeled examples way out of distribution.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever.
Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However…
I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in. That doesn’t involve any internal search over the human utility function embedded in GPT-4′s weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it’s pretty good at offering ethical advice in most situations that you’re ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won’t be long before GPT-N is about as good at knowing what’s ethical as your average human; maybe it’ll even be a bit more ethical.
(But yes, this isn’t the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)
I agree that MIRI people are interested in things like “what transhumanist utopia will the AI be motivated to build” but I think saying that this is the hard part of the value identification problem is pretty much just moving the goalposts from what I thought the original claim was. Very few, if any, humans can tell you exactly how to build the transhumanist utopia either. If the original thesis was “human values are hard to identify because it’s hard to extract all the nuances of value embedded in human brains”, now the thesis is becoming “human values are hard to identify because literally no one knows how to build the transhumanist utopia”.
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.)
I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.”
That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context).
Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.)
I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.)
AFAIK, GPT-4 is a mix of “extrapolating text-continuation patterns learned from the internet” + “RLHF based on labeled examples”.
For the former, I note that Eliezer commented in 2018 that “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.” It kinda sounds like Eliezer is most comfortable thinking of RL, and sees SL as kinda different, maybe? (I could talk about my models here, but that’s a different topic… Anyway, I’m not really sure what Eliezer thinks.)
For the latter, again I think it’s a question of whether we care about our ability to extrapolate the labeled examples way out of distribution.