The main way complexity of this sort would be addressable is if the intellectual artifact that you tried to prove things about were simpler than the process that you meant the artifact to unfold into. For example, the mathematical specification of AIXI is pretty simple, even though the hypotheses that AIXI would (in principle) invent upon exposure to any given environment would mostly be complex. Or for a more concrete example, the Gallina kernel of the Coq proof engine is small and was verified to be correct using other proof tools, while most of the complexity of Coq is in built-up layers of proof search strategies which don’t need to themselves be verified, as the proofs they generate are checked by Gallina.
Isn’t that as unbelievable as the idea that you can prove that a particular zygote will never grow up to be an evil dictator? Surely this violates some principles of complexity, chaos [...]
Yes, any physical system could be subverted with a sufficiently unfavorable environment. You wouldn’t want to prove perfection. The thing you would want to prove would be more along the lines of, “will this system become at least somewhere around as capable of recovering from any disturbances, and of going on to achieve a good result, as it would be if its designers had thought specifically about what to do in case of each possible disturbance?”. (Ideally, this category of “designers” would also sort of bleed over in a principled way into the category of “moral constituency”, as in CEV.) Which, in turn, would require a proof of something along the lines of “the process is highly likely to make it to the point where it knows enough about its designers to be able to mostly duplicate their hypothetical reasoning about what it should do, without anything going terribly wrong”.
We don’t know what an appropriate formalization of something like that would look like. But there is reason for considerable hope that such a formalization could be found, and that this formalization would be sufficiently simple that an implementation of it could be checked. This is because a few other aspects of decision-making which were previously mysterious, and which could only be discussed qualitatively, have had powerful and simple core mathematical descriptions discovered for cases where simplifying modeling assumptions perfectly apply. Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above. Also, you’d have to formalize something like our informal notion of how to generate and evaluate approximations, because approximations are more complex than the ideals they approximate, and you wouldn’t want to need to directly verify the safety of any more approximations than you had to. (But note that, for reasons related to Rice’s theorem, you can’t (and therefore shouldn’t want to) lay down universally perfect rules for approximation in any finite system.)
Two other related points are discussed in this presentation: the idea that a digital computer is a nearly deterministic environment, which makes safety engineering easier for the stages before the AI is trying to influence the environment outside the computer, and the idea that you can design an AI in such a way that you can tell what goal it will at least try to achieve even if you don’t know what it will do to achieve that goal. Presumably, the better your formal understanding of what it would mean to “at least try to achieve a goal”, the better you would be at spotting and designing to handle situations that might make a given AI start trying to do something else.
(Also: Can you offer some feedback as to what features of the site would have helped you sooner be aware that there were arguments behind the positions that you felt were being asserted blindly in a vacuum? The “things can be surprisingly formalizable, here are some examples” argument can be found in lukeprog’s “Open Problems Related to the Singularity” draft and the later “So You Want to Save the World”, though the argument is very short and hard to recognize the significance of if you don’t already know most of the mathematical formalisms mentioned. A backup “you shouldn’t just assume that there’s no way to make this work” argument is in “Artificial Intelligence as a Positive and Negative Factor in Global Risk”, pp 12-13.)
what will prevent them from becoming “bad guys” when they wield this much power
That’s a problem where successful/practically applicable formalizations are harder to hope for, so it’s been harder for people to find things to say about it that pass the threshold of being plausible conceptual progress instead of being noisy verbal flailing. See the related “How can we ensure that a Friendly AI team will be sane enough?”. But it’s not like people aren’t thinking about the problem.
This is actually one of the best comments I’ve seen on Less Wrong, especially this part:
Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above.
The main way complexity of this sort would be addressable is if the intellectual artifact that you tried to prove things about were simpler than the process that you meant the artifact to unfold into. For example, the mathematical specification of AIXI is pretty simple, even though the hypotheses that AIXI would (in principle) invent upon exposure to any given environment would mostly be complex. Or for a more concrete example, the Gallina kernel of the Coq proof engine is small and was verified to be correct using other proof tools, while most of the complexity of Coq is in built-up layers of proof search strategies which don’t need to themselves be verified, as the proofs they generate are checked by Gallina.
Yes, any physical system could be subverted with a sufficiently unfavorable environment. You wouldn’t want to prove perfection. The thing you would want to prove would be more along the lines of, “will this system become at least somewhere around as capable of recovering from any disturbances, and of going on to achieve a good result, as it would be if its designers had thought specifically about what to do in case of each possible disturbance?”. (Ideally, this category of “designers” would also sort of bleed over in a principled way into the category of “moral constituency”, as in CEV.) Which, in turn, would require a proof of something along the lines of “the process is highly likely to make it to the point where it knows enough about its designers to be able to mostly duplicate their hypothetical reasoning about what it should do, without anything going terribly wrong”.
We don’t know what an appropriate formalization of something like that would look like. But there is reason for considerable hope that such a formalization could be found, and that this formalization would be sufficiently simple that an implementation of it could be checked. This is because a few other aspects of decision-making which were previously mysterious, and which could only be discussed qualitatively, have had powerful and simple core mathematical descriptions discovered for cases where simplifying modeling assumptions perfectly apply. Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above. Also, you’d have to formalize something like our informal notion of how to generate and evaluate approximations, because approximations are more complex than the ideals they approximate, and you wouldn’t want to need to directly verify the safety of any more approximations than you had to. (But note that, for reasons related to Rice’s theorem, you can’t (and therefore shouldn’t want to) lay down universally perfect rules for approximation in any finite system.)
Two other related points are discussed in this presentation: the idea that a digital computer is a nearly deterministic environment, which makes safety engineering easier for the stages before the AI is trying to influence the environment outside the computer, and the idea that you can design an AI in such a way that you can tell what goal it will at least try to achieve even if you don’t know what it will do to achieve that goal. Presumably, the better your formal understanding of what it would mean to “at least try to achieve a goal”, the better you would be at spotting and designing to handle situations that might make a given AI start trying to do something else.
(Also: Can you offer some feedback as to what features of the site would have helped you sooner be aware that there were arguments behind the positions that you felt were being asserted blindly in a vacuum? The “things can be surprisingly formalizable, here are some examples” argument can be found in lukeprog’s “Open Problems Related to the Singularity” draft and the later “So You Want to Save the World”, though the argument is very short and hard to recognize the significance of if you don’t already know most of the mathematical formalisms mentioned. A backup “you shouldn’t just assume that there’s no way to make this work” argument is in “Artificial Intelligence as a Positive and Negative Factor in Global Risk”, pp 12-13.)
That’s a problem where successful/practically applicable formalizations are harder to hope for, so it’s been harder for people to find things to say about it that pass the threshold of being plausible conceptual progress instead of being noisy verbal flailing. See the related “How can we ensure that a Friendly AI team will be sane enough?”. But it’s not like people aren’t thinking about the problem.
This is actually one of the best comments I’ve seen on Less Wrong, especially this part:
Thanks for the clear explanation.