Steven Byrnes comments on Matthew Barnett’s Shortform

Steven Byrnes 22 Apr 2023 12:48 UTC
LW: 2 AF: 2
AF
(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.)
moving the goalposts from what I thought the original claim was
I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.”
That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context).
Very few, if any, humans can tell you exactly how to build the transhumanist utopia either.
Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.)
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.)
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in.
AFAIK, GPT-4 is a mix of “extrapolating text-continuation patterns learned from the internet” + “RLHF based on labeled examples”.
For the former, I note that Eliezer commented in 2018 that “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.” It kinda sounds like Eliezer is most comfortable thinking of RL, and sees SL as kinda different, maybe? (I could talk about my models here, but that’s a different topic… Anyway, I’m not really sure what Eliezer thinks.)
For the latter, again I think it’s a question of whether we care about our ability to extrapolate the labeled examples way out of distribution.