There is a case that aligned AI doesn’t have to be competitive with unaligned AI, it just has to be much better than humans at alignment research. Because, if this holds, then we can delegate the rest of the problem to the AI.
Where it might fail is: it takes so much work to solve the alignment problem that even that superhuman aligned AI will not do it in time to build the “next stage” aligned AI (i.e. before the even-more-superhuman unaligned AI is deployed). In this case, it might be advantageous to have mere humans doing extra progress in alignment between the point “first stage” solution is available and the point the first stage aligned AI is deployed.
The bigger the capability gap between first stage aligned AI and humans, the less valuable this extra progress becomes (because the AI would be able to do it on its own much faster). On the other hand, the smaller the time difference between first stage aligned AI deployment and unaligned AI deployment, the more valuable this extra progress becomes.
There is a case that aligned AI doesn’t have to be competitive with unaligned AI, it just has to be much better than humans at alignment research. Because, if this holds, then we can delegate the rest of the problem to the AI.
I don’t find this at all reassuring, because by the same construction your aligned-alignment-researcher-AI may now be racing an unaligned-capabilities-researcher-AI, and my intuition is that the latter is an easier problem because you don’t have to worry about complexity of value or anything, just make the loss/Tflop go down.
If equally advanced unaligned AI is deployed earlier than aligned AI, then we might be screwed anyway. My point is, if aligned AI is deployed earlier by a sufficient margin, then it can bootstrap itself to an effective anti-unaligned-AI shield in time.
There is a case that aligned AI doesn’t have to be competitive with unaligned AI, it just has to be much better than humans at alignment research. Because, if this holds, then we can delegate the rest of the problem to the AI.
Where it might fail is: it takes so much work to solve the alignment problem that even that superhuman aligned AI will not do it in time to build the “next stage” aligned AI (i.e. before the even-more-superhuman unaligned AI is deployed). In this case, it might be advantageous to have mere humans doing extra progress in alignment between the point “first stage” solution is available and the point the first stage aligned AI is deployed.
The bigger the capability gap between first stage aligned AI and humans, the less valuable this extra progress becomes (because the AI would be able to do it on its own much faster). On the other hand, the smaller the time difference between first stage aligned AI deployment and unaligned AI deployment, the more valuable this extra progress becomes.
Yeah, I talk about this in the first bullet point here (which I linked from the “How useful is it...” section).
I don’t find this at all reassuring, because by the same construction your aligned-alignment-researcher-AI may now be racing an unaligned-capabilities-researcher-AI, and my intuition is that the latter is an easier problem because you don’t have to worry about complexity of value or anything, just make the loss/Tflop go down.
If equally advanced unaligned AI is deployed earlier than aligned AI, then we might be screwed anyway. My point is, if aligned AI is deployed earlier by a sufficient margin, then it can bootstrap itself to an effective anti-unaligned-AI shield in time.