So now that we have this picture, let’s try to use it to explain some common disagreements about AI alignment.
I’ll use the above pictures to explain another disagreement about AI
alignment, one you did not explore above.
I fundamentally disagree with your framing that successful AI
alignment research is about closing the capabilities gap in the above
pictures.
To better depict the goals of AI alignment research, I would draw a
different set of graphs that have alignment with humanity as the
metric on the vertical axis, not capabilities.
When you re-draw the above graphs with an alignment metric on the
vertical axis, then all aligned paperclip maximizers will greatly
out-score Bostrom’s unaligned maximally capable paperclip maximizer.
Conversely, Bostrom’s unaligned superintelligent paperclip maximizer will always top any capabilities chart,
when the capability score is defined as performance on a paperclip
maximizing benchmark. Unlike an unaligned AI, an aligned AI will have to stop
converting the whole planet into paperclips when the humans tell it to
stop.
So fundamentally, if we start from the assumption that we will not be able to define the perfectly aligned reward function that Bostrom also imagines, a function that is perfectly aligned even with the goals of all future generations, then alignment research has to be about creating the gap in the above
capabilities graphs, not about closing it.
Nice pictures!
I’ll use the above pictures to explain another disagreement about AI alignment, one you did not explore above.
I fundamentally disagree with your framing that successful AI alignment research is about closing the capabilities gap in the above pictures.
To better depict the goals of AI alignment research, I would draw a different set of graphs that have alignment with humanity as the metric on the vertical axis, not capabilities. When you re-draw the above graphs with an alignment metric on the vertical axis, then all aligned paperclip maximizers will greatly out-score Bostrom’s unaligned maximally capable paperclip maximizer.
Conversely, Bostrom’s unaligned superintelligent paperclip maximizer will always top any capabilities chart, when the capability score is defined as performance on a paperclip maximizing benchmark. Unlike an unaligned AI, an aligned AI will have to stop converting the whole planet into paperclips when the humans tell it to stop.
So fundamentally, if we start from the assumption that we will not be able to define the perfectly aligned reward function that Bostrom also imagines, a function that is perfectly aligned even with the goals of all future generations, then alignment research has to be about creating the gap in the above capabilities graphs, not about closing it.