My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist
Mind expanding on that? Which scenarios are you envisioning?
the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety
They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
Mind expanding on that? Which scenarios are you envisioning?
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
When we get to scenarios that don’t involve AI control issues, things get worse.
Mind expanding on that? Which scenarios are you envisioning?
They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
When we get to scenarios that don’t involve AI control issues, things get worse.