Here’s the thing, though. I think the specifically relevant reference class here is “what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?”. And instances of that in the human history are… not pleasant. Wars, genocide, xenophobia. Over time, we’ve managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.
I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly.
And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it’s still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we’d be able to align it. The problem is that we currently have no tools for that at all.
I think I have 2 cruxes here, actually.
My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized.
My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety.
But I think equating “the generator of human values” with “the brain’s learning algorithms” is a mistake.
I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain’s learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.
My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist
Mind expanding on that? Which scenarios are you envisioning?
the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety
They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
Mind expanding on that? Which scenarios are you envisioning?
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
When we get to scenarios that don’t involve AI control issues, things get worse.
I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly.
I think I have 2 cruxes here, actually.
My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized.
My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety.
I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain’s learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.
Mind expanding on that? Which scenarios are you envisioning?
They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
When we get to scenarios that don’t involve AI control issues, things get worse.