The issue arises specifically in the situation of recursive self-improvement: You can’t prove self-consistency in mathematical frameworks of “sufficient complexity” (that is, containing the rules of arithmetic in a provable manner).
What this cashes out to is that, considering AI as a mathematical framework, and the next generation of AI (designed by the first) as a secondary mathematical framework—you can’t actually prove that there are no contradictions in an umbrella mathematical framework that comprises both of them, if they are of “sufficient complexity”. Which means an AI cannot -prove- that a successor AI has not experienced value drift—that is, that the combined mathematical framework does not contain contradictions—if they are of sufficient complexity.
To illustrate the issue, suppose the existence of a powerful creator AI, designing its successor; the successor, presumably, is more powerful than the creator AI in some fashion, and so there are areas of the combinatorially large space that the successor AI can explore (in a reasonable timeframe), but that the creator AI cannot. If the creator can prove there are no contradictions in the combined mathematical framework—then, supposing its values are embedded in that framework in a provable manner, it can be assured that the successor has not experienced value drift.
Mind, I don’t particularly think the above scenario is terribly likely; I have strong doubts about basically everything in there, in particular the idea of provable values. I created the post for the five or six people who might still be interested in ideas I haven’t seen kicked around on Less Wrong for over a decade.
The issue arises specifically in the situation of recursive self-improvement: You can’t prove self-consistency in mathematical frameworks of “sufficient complexity” (that is, containing the rules of arithmetic in a provable manner).
What this cashes out to is that, considering AI as a mathematical framework, and the next generation of AI (designed by the first) as a secondary mathematical framework—you can’t actually prove that there are no contradictions in an umbrella mathematical framework that comprises both of them, if they are of “sufficient complexity”. Which means an AI cannot -prove- that a successor AI has not experienced value drift—that is, that the combined mathematical framework does not contain contradictions—if they are of sufficient complexity.
To illustrate the issue, suppose the existence of a powerful creator AI, designing its successor; the successor, presumably, is more powerful than the creator AI in some fashion, and so there are areas of the combinatorially large space that the successor AI can explore (in a reasonable timeframe), but that the creator AI cannot. If the creator can prove there are no contradictions in the combined mathematical framework—then, supposing its values are embedded in that framework in a provable manner, it can be assured that the successor has not experienced value drift.
Mind, I don’t particularly think the above scenario is terribly likely; I have strong doubts about basically everything in there, in particular the idea of provable values. I created the post for the five or six people who might still be interested in ideas I haven’t seen kicked around on Less Wrong for over a decade.