Galaxy-brained reason not to work on AI alignment: anti-aligned ASI is orders of magnitude more bad than aligned ASI is good, so it’s better to ensure that the values of the Singularity are more or less orthogonal to CEV (which happens by default).
I see your point as warning against approaches that are like “get the AI entangled with stuff about humans and hope that helps”.
There are other approaches with a goal more like “make it possible for the humans to steer the thing and have scalable oversight over what’s happening”.
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn’t notice something as wrong as a minus sign, you’re probably in trouble about noticing other misalignment.
I had a long back-and-forth about that topic here. Among other things, I disagree that “more or less orthogonal to CEV” is the default in the absence of alignment research, because people will presumably be trying to align their AIs, and I think there are will be obvious techniques which will work well enough to get out of the “random goal” regime, but not well enough for reliability.
Worth remembering that flips of the reward function do happen: https://openai.com/blog/fine-tuning-gpt-2/#bugscanoptimizeforbadbehavior (“Was this a loss to minimize or a reward to maximize...”)
Galaxy-brained reason not to work on AI alignment: anti-aligned ASI is orders of magnitude more bad than aligned ASI is good, so it’s better to ensure that the values of the Singularity are more or less orthogonal to CEV (which happens by default).
I see your point as warning against approaches that are like “get the AI entangled with stuff about humans and hope that helps”.
There are other approaches with a goal more like “make it possible for the humans to steer the thing and have scalable oversight over what’s happening”.
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn’t notice something as wrong as a minus sign, you’re probably in trouble about noticing other misalignment.
I had a long back-and-forth about that topic here. Among other things, I disagree that “more or less orthogonal to CEV” is the default in the absence of alignment research, because people will presumably be trying to align their AIs, and I think there are will be obvious techniques which will work well enough to get out of the “random goal” regime, but not well enough for reliability.
people trying to align their AIs == alignment research