The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn’t care
But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I’ve been calling detachment, and possibly others.
This is where the discourse should be focusing on, IMO. This is the update/direction I want to see you make. The sequence of things being learned/internalized/chiseled is important.
My imagined Eliezer has many replies to this, with numerous branches in the dialogue/argument tree which I don’t want to get into now. But this *first step* towards recognizing the new place we are in, specifically wrt the ability to target human values (whether for deceptive, disinterested, detached, or actual caring reasons!), needs to be taken imo, rather than repeating this line of “of course I understood that a superint would understand human values; this isn’t an update for me”.
(edit: My comments here are regarding the larger discourse, not just this specific post or reply-chain)
The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn’t care
But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I’ve been calling detachment, and possibly others.
This is where the discourse should be focusing on, IMO. This is the update/direction I want to see you make. The sequence of things being learned/internalized/chiseled is important.
My imagined Eliezer has many replies to this, with numerous branches in the dialogue/argument tree which I don’t want to get into now. But this *first step* towards recognizing the new place we are in, specifically wrt the ability to target human values (whether for deceptive, disinterested, detached, or actual caring reasons!), needs to be taken imo, rather than repeating this line of “of course I understood that a superint would understand human values; this isn’t an update for me”.
(edit: My comments here are regarding the larger discourse, not just this specific post or reply-chain)