I see your point as warning against approaches that are like “get the AI entangled with stuff about humans and hope that helps”.
There are other approaches with a goal more like “make it possible for the humans to steer the thing and have scalable oversight over what’s happening”.
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn’t notice something as wrong as a minus sign, you’re probably in trouble about noticing other misalignment.
I see your point as warning against approaches that are like “get the AI entangled with stuff about humans and hope that helps”.
There are other approaches with a goal more like “make it possible for the humans to steer the thing and have scalable oversight over what’s happening”.
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn’t notice something as wrong as a minus sign, you’re probably in trouble about noticing other misalignment.