I, too, think that AIs that don’t optimize a function over the universe (but might optimize one over a domain) are more likely to be safe. This is quite related to the idea of tool AI, proposed by Holden and criticized by Eliezer.
The key here seems to be creating a way to evaluate and search for self-improvements in a way that won’t cause optimization over universe states. In theory, evaluation of a self-improvement might be able to be restricted to a domain: does this modification help me play chess better according to a model of the situation in which a Cartesian boundary exists and I am running on an abstract virtual machine rather than physical computers embedded in the universe?
We could separate this task into generation of possible self-improvements and tests of self-improvements (a suggester/verifier architecture). If the suggester is superintelligent and dangerous (e.g. as a consequence of being universe-optimizing), then we cannot safely test these self-improvements. This is for a few reasons:
testing a self-improvement involves executing code. If the virtual machine has security flaws, then this could enable the suggester to cause the AI to rewrite into a universe-optimizer or otherwise do something dangerous.
it is possible to create self-improvements that look good according to the verifier but contain information that psychologically manipulates humans (see the AI-box experiment)
a self-modification could score fairly well according to the abstract evaluation function, but at the same time also behave in a universe-optimizing manner some of the time. This self-modification would be considered suboptimal by the verifier, but might still be considered relatively good.
So we can’t just use any old suggester and rely on the verifier to do the heavy lifting. Luckily, it may be possible to construct a suggester that acts as if it is running on an abstract virtual machine. If the suggester does not (directly or indirectly) use universe-optimizing as a way to generate self-improvements, then it shouldn’t generate universe-optimizing (therefore dangerous) self-improvements. The key gotcha here is the “indirectly” part: how do we know that the suggester isn’t (e.g.) using many different heuristics to come up with improvements, where some combination of the heuristics ends up expressing something like “try creating improvements by optimizing the universe”. In other words, is universe-optimizing a somewhat useful strategy for finding improvements that good general abstract-mathematical-function-optimizers will pick up on? I don’t know the answer to this question. But if we could design suggesters that don’t directly or indirectly optimize a function over the universe, then maybe this will work.
I, too, think that AIs that don’t optimize a function over the universe (but might optimize one over a domain) are more likely to be safe. This is quite related to the idea of tool AI, proposed by Holden and criticized by Eliezer.
The key here seems to be creating a way to evaluate and search for self-improvements in a way that won’t cause optimization over universe states. In theory, evaluation of a self-improvement might be able to be restricted to a domain: does this modification help me play chess better according to a model of the situation in which a Cartesian boundary exists and I am running on an abstract virtual machine rather than physical computers embedded in the universe?
We could separate this task into generation of possible self-improvements and tests of self-improvements (a suggester/verifier architecture). If the suggester is superintelligent and dangerous (e.g. as a consequence of being universe-optimizing), then we cannot safely test these self-improvements. This is for a few reasons:
testing a self-improvement involves executing code. If the virtual machine has security flaws, then this could enable the suggester to cause the AI to rewrite into a universe-optimizer or otherwise do something dangerous.
it is possible to create self-improvements that look good according to the verifier but contain information that psychologically manipulates humans (see the AI-box experiment)
a self-modification could score fairly well according to the abstract evaluation function, but at the same time also behave in a universe-optimizing manner some of the time. This self-modification would be considered suboptimal by the verifier, but might still be considered relatively good.
So we can’t just use any old suggester and rely on the verifier to do the heavy lifting. Luckily, it may be possible to construct a suggester that acts as if it is running on an abstract virtual machine. If the suggester does not (directly or indirectly) use universe-optimizing as a way to generate self-improvements, then it shouldn’t generate universe-optimizing (therefore dangerous) self-improvements. The key gotcha here is the “indirectly” part: how do we know that the suggester isn’t (e.g.) using many different heuristics to come up with improvements, where some combination of the heuristics ends up expressing something like “try creating improvements by optimizing the universe”. In other words, is universe-optimizing a somewhat useful strategy for finding improvements that good general abstract-mathematical-function-optimizers will pick up on? I don’t know the answer to this question. But if we could design suggesters that don’t directly or indirectly optimize a function over the universe, then maybe this will work.