Just a gut reaction, but this whole scenario sounds preposterous. Do you guys seriously believe that you can create something as complex as a superhuman AI, and prove that it is completely safe before turning it on? Isn’t that as unbelievable as the idea that you can prove that a particular zygote will never grow up to be an evil dictator? Surely this violates some principles of complexity, chaos, quantum mechanics, etc.? And I would also like to know who these “good guys” are, and what will prevent them from becoming “bad guys” when they wield this much power. This all sounds incredibly naive and lacking in common sense!
The main way complexity of this sort would be addressable is if the intellectual artifact that you tried to prove things about were simpler than the process that you meant the artifact to unfold into. For example, the mathematical specification of AIXI is pretty simple, even though the hypotheses that AIXI would (in principle) invent upon exposure to any given environment would mostly be complex. Or for a more concrete example, the Gallina kernel of the Coq proof engine is small and was verified to be correct using other proof tools, while most of the complexity of Coq is in built-up layers of proof search strategies which don’t need to themselves be verified, as the proofs they generate are checked by Gallina.
Isn’t that as unbelievable as the idea that you can prove that a particular zygote will never grow up to be an evil dictator? Surely this violates some principles of complexity, chaos [...]
Yes, any physical system could be subverted with a sufficiently unfavorable environment. You wouldn’t want to prove perfection. The thing you would want to prove would be more along the lines of, “will this system become at least somewhere around as capable of recovering from any disturbances, and of going on to achieve a good result, as it would be if its designers had thought specifically about what to do in case of each possible disturbance?”. (Ideally, this category of “designers” would also sort of bleed over in a principled way into the category of “moral constituency”, as in CEV.) Which, in turn, would require a proof of something along the lines of “the process is highly likely to make it to the point where it knows enough about its designers to be able to mostly duplicate their hypothetical reasoning about what it should do, without anything going terribly wrong”.
We don’t know what an appropriate formalization of something like that would look like. But there is reason for considerable hope that such a formalization could be found, and that this formalization would be sufficiently simple that an implementation of it could be checked. This is because a few other aspects of decision-making which were previously mysterious, and which could only be discussed qualitatively, have had powerful and simple core mathematical descriptions discovered for cases where simplifying modeling assumptions perfectly apply. Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above. Also, you’d have to formalize something like our informal notion of how to generate and evaluate approximations, because approximations are more complex than the ideals they approximate, and you wouldn’t want to need to directly verify the safety of any more approximations than you had to. (But note that, for reasons related to Rice’s theorem, you can’t (and therefore shouldn’t want to) lay down universally perfect rules for approximation in any finite system.)
Two other related points are discussed in this presentation: the idea that a digital computer is a nearly deterministic environment, which makes safety engineering easier for the stages before the AI is trying to influence the environment outside the computer, and the idea that you can design an AI in such a way that you can tell what goal it will at least try to achieve even if you don’t know what it will do to achieve that goal. Presumably, the better your formal understanding of what it would mean to “at least try to achieve a goal”, the better you would be at spotting and designing to handle situations that might make a given AI start trying to do something else.
(Also: Can you offer some feedback as to what features of the site would have helped you sooner be aware that there were arguments behind the positions that you felt were being asserted blindly in a vacuum? The “things can be surprisingly formalizable, here are some examples” argument can be found in lukeprog’s “Open Problems Related to the Singularity” draft and the later “So You Want to Save the World”, though the argument is very short and hard to recognize the significance of if you don’t already know most of the mathematical formalisms mentioned. A backup “you shouldn’t just assume that there’s no way to make this work” argument is in “Artificial Intelligence as a Positive and Negative Factor in Global Risk”, pp 12-13.)
what will prevent them from becoming “bad guys” when they wield this much power
That’s a problem where successful/practically applicable formalizations are harder to hope for, so it’s been harder for people to find things to say about it that pass the threshold of being plausible conceptual progress instead of being noisy verbal flailing. See the related “How can we ensure that a Friendly AI team will be sane enough?”. But it’s not like people aren’t thinking about the problem.
This is actually one of the best comments I’ve seen on Less Wrong, especially this part:
Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above.
The idea is not “take an arbitrary superhuman AI and then verify it’s destined to be well behaved” but rather “develop a mathematical framework that allows you from the ground up to design a specific AI that will remain (provably) well behaved, even though you can’t, for arbitrary AIs, determine whether or not they’ll be well behaved.”
It could be evidence that the questioner isn’t worth engaging, because the conversation is unlikely to be productive. The questioner might have significantly motivated cognition or have written the bottom line.
On the contrary, adversarial questioners are often highly productive. I’ve already incited one of the best comments you’ve seen on LessWrong, haven’t I?
Yes, my cognition is significantly motivated along these lines. Doesn’t Hitler deserve some of the credit for the rapid development of computers and nuclear bombs? Perhaps I or someone like me will play a similar role in the development of AI?
On the contrary, adversarial questioners are often highly productive. I’ve already incited one of the best comments you’ve seen on LessWrong, haven’t I?
Don’t take too much credit. Steve_Rayhawk generated the comment by actively trying to help. But if his goal was to engage you in thoughtful and productive discussion, he probably failed, and it was probably a waste of his time to try. There happened to be this positive externality of an excellent comment—but that’s the kind of thing that’s generated as a result of doing your best to understand a complex issue, not adversarially mucking up the conversation about it.
Yes, my cognition is significantly motivated along these lines. Doesn’t Hitler deserve some of the credit for the rapid development of computers and nuclear bombs? Perhaps I or someone like me will play a similar role in the development of AI?
Yup. And there’s an established pattern on Less Wrong of consistently downvoted users that just get more and more trolly over time and end up getting banned, but waste a lot of everyone’s time until it happens. Good thing AlphaOmega already Godwin’d themselves.
Just a gut reaction, but this whole scenario sounds preposterous. Do you guys seriously believe that you can create something as complex as a superhuman AI, and prove that it is completely safe before turning it on? Isn’t that as unbelievable as the idea that you can prove that a particular zygote will never grow up to be an evil dictator? Surely this violates some principles of complexity, chaos, quantum mechanics, etc.? And I would also like to know who these “good guys” are, and what will prevent them from becoming “bad guys” when they wield this much power. This all sounds incredibly naive and lacking in common sense!
The main way complexity of this sort would be addressable is if the intellectual artifact that you tried to prove things about were simpler than the process that you meant the artifact to unfold into. For example, the mathematical specification of AIXI is pretty simple, even though the hypotheses that AIXI would (in principle) invent upon exposure to any given environment would mostly be complex. Or for a more concrete example, the Gallina kernel of the Coq proof engine is small and was verified to be correct using other proof tools, while most of the complexity of Coq is in built-up layers of proof search strategies which don’t need to themselves be verified, as the proofs they generate are checked by Gallina.
Yes, any physical system could be subverted with a sufficiently unfavorable environment. You wouldn’t want to prove perfection. The thing you would want to prove would be more along the lines of, “will this system become at least somewhere around as capable of recovering from any disturbances, and of going on to achieve a good result, as it would be if its designers had thought specifically about what to do in case of each possible disturbance?”. (Ideally, this category of “designers” would also sort of bleed over in a principled way into the category of “moral constituency”, as in CEV.) Which, in turn, would require a proof of something along the lines of “the process is highly likely to make it to the point where it knows enough about its designers to be able to mostly duplicate their hypothetical reasoning about what it should do, without anything going terribly wrong”.
We don’t know what an appropriate formalization of something like that would look like. But there is reason for considerable hope that such a formalization could be found, and that this formalization would be sufficiently simple that an implementation of it could be checked. This is because a few other aspects of decision-making which were previously mysterious, and which could only be discussed qualitatively, have had powerful and simple core mathematical descriptions discovered for cases where simplifying modeling assumptions perfectly apply. Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above. Also, you’d have to formalize something like our informal notion of how to generate and evaluate approximations, because approximations are more complex than the ideals they approximate, and you wouldn’t want to need to directly verify the safety of any more approximations than you had to. (But note that, for reasons related to Rice’s theorem, you can’t (and therefore shouldn’t want to) lay down universally perfect rules for approximation in any finite system.)
Two other related points are discussed in this presentation: the idea that a digital computer is a nearly deterministic environment, which makes safety engineering easier for the stages before the AI is trying to influence the environment outside the computer, and the idea that you can design an AI in such a way that you can tell what goal it will at least try to achieve even if you don’t know what it will do to achieve that goal. Presumably, the better your formal understanding of what it would mean to “at least try to achieve a goal”, the better you would be at spotting and designing to handle situations that might make a given AI start trying to do something else.
(Also: Can you offer some feedback as to what features of the site would have helped you sooner be aware that there were arguments behind the positions that you felt were being asserted blindly in a vacuum? The “things can be surprisingly formalizable, here are some examples” argument can be found in lukeprog’s “Open Problems Related to the Singularity” draft and the later “So You Want to Save the World”, though the argument is very short and hard to recognize the significance of if you don’t already know most of the mathematical formalisms mentioned. A backup “you shouldn’t just assume that there’s no way to make this work” argument is in “Artificial Intelligence as a Positive and Negative Factor in Global Risk”, pp 12-13.)
That’s a problem where successful/practically applicable formalizations are harder to hope for, so it’s been harder for people to find things to say about it that pass the threshold of being plausible conceptual progress instead of being noisy verbal flailing. See the related “How can we ensure that a Friendly AI team will be sane enough?”. But it’s not like people aren’t thinking about the problem.
This is actually one of the best comments I’ve seen on Less Wrong, especially this part:
Thanks for the clear explanation.
The idea is not “take an arbitrary superhuman AI and then verify it’s destined to be well behaved” but rather “develop a mathematical framework that allows you from the ground up to design a specific AI that will remain (provably) well behaved, even though you can’t, for arbitrary AIs, determine whether or not they’ll be well behaved.”
I think this comment is disingenuous, given your statements that the extinction of humanity is inevitable and that you have a website using evil AI imagery. http://lesswrong.com/lw/b5i/a_primer_on_risks_from_ai/64dq
Whether the individual in question has other motivations doesn’t by itself make the questions raised any less valid.
It could be evidence that the questioner isn’t worth engaging, because the conversation is unlikely to be productive. The questioner might have significantly motivated cognition or have written the bottom line.
On the contrary, adversarial questioners are often highly productive. I’ve already incited one of the best comments you’ve seen on LessWrong, haven’t I?
Yes, my cognition is significantly motivated along these lines. Doesn’t Hitler deserve some of the credit for the rapid development of computers and nuclear bombs? Perhaps I or someone like me will play a similar role in the development of AI?
Don’t take too much credit. Steve_Rayhawk generated the comment by actively trying to help. But if his goal was to engage you in thoughtful and productive discussion, he probably failed, and it was probably a waste of his time to try. There happened to be this positive externality of an excellent comment—but that’s the kind of thing that’s generated as a result of doing your best to understand a complex issue, not adversarially mucking up the conversation about it.
Somehow I doubt that’s the true cause of your behavior, but I’d be delighted to find out that I’m wrong.
Yup. And there’s an established pattern on Less Wrong of consistently downvoted users that just get more and more trolly over time and end up getting banned, but waste a lot of everyone’s time until it happens. Good thing AlphaOmega already Godwin’d themselves.