You don’t need to let the superintelligence give you an entire mind; you can split off subproblems for which there is a secure proof procedure, and give it those.
The most general form is that of a textbook. Have the AI generate a textbook that teaches you Friendliness theory, with exercises and all, so that you’ll be able to construct a FAI on your own (with a team, etc.), fully understanding why and how it’s supposed to work. We all know how secure human psychology is and how it’s totally impossible to be fooled, especially for multiple people simultaneously, even by superintelligences that have a motive to deceive you.
The point of this exercise was to allow the unfriendly AI access not even to the most limited communication it might be able to accomplish using byproducts of computation. Having it produce an unconstrained textbook which a human actually reads is clearly absurd, but I’m not sure what you are trying to mock.
Do you think it is impossible to formulate any question which the AI can answer without exercising undue influence? It seems hard, but I don’t see why you would ridicule the possibility.
In first part of the grandparent, I attempted to strengthen the method. I don’t think automatic checking of a generated AI is anywhere feasible, and so there is little point in discussing that. But a textbook is clearly feasible as means of communicating a solution to a philosophical problem.. And then there are all those obvious old-hat failure modes resulting from the textbook, if only it could be constructed without possibility of causing psychological damage. Maybe we can even improve on that, by having several “generations” of researchers write a textbook for the next one, so that any brittle manipulation patterns contained in the original break.
I’m sorry, I misunderstood your intention completely, probably because of the italics :)
I personally am paranoid enough about ambivalent transhumans that I would be very afraid of giving them such a powerful channel to the outside world, even if they had to pass through many levels of iterative improvement (if I could make a textbook that hijacked your mind, then I could just have you write a similarly destructive textbook to the next researcher, etc.).
I think it is possible to exploit an unfriendly boxed AI to bootstrap to friendliness in a theoretically safe way. Minimally, the problem of doing it safely is very interesting and difficult, and I think I have a solid first step to a solution.
It’s an attractive idea for science fiction, but I think no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire. I just don’t think text and diagrams on a printed page, read slowly in the usual way, would have the bandwidth to rapidly and reliably “hack” any human. Needless to say, you would proceed with caution just in case.
I you don’t mind hurting a few volunteers to defend humanity from a much bigger threat, it should be fairly easy to detect, quarantine and possibly treat the damaged ones. They, after all, would only be ordinarily intelligent. Super-cunning existential-risk malevolence isn’t transitive.
I think a textbook full of sensory exploit hacks would be pretty valuable data in itself, but maybe I’m not a completely friendly natural intelligence ;-)
edit: Oh, I may have missed the point that of course you couldn’t trust the methods in the textbook for constructing FAI even if it itself posed no direct danger. Agreed.
no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire.
Consider that humans can and have made such grimoires; they call them bibles. All it takes is a nonrational but sufficiently appealing idea and an imperfect rationalist falls to it. If there’s a true hole in the textbook’s information, such that it produces unfriendly AI instead of friendly, and the AI who wrote the textbook handwaved that hole away, how confident are you that you would spot the best hand-waving ever written?
Not confident at all. In fact I have seen no evidence for the possibility, even in principle, of provably friendly AI. And if there were such evidence, I wouldn’t be able to understand it well enough to evaluate it.
In fact I wouldn’t trust such a textbook even written by human experts whose motives I trusted. The problem isn’t proving the theorems, it’s choosing the axioms.
You don’t need to let the superintelligence give you an entire mind; you can split off subproblems for which there is a secure proof procedure, and give it those.
The most general form is that of a textbook. Have the AI generate a textbook that teaches you Friendliness theory, with exercises and all, so that you’ll be able to construct a FAI on your own (with a team, etc.), fully understanding why and how it’s supposed to work. We all know how secure human psychology is and how it’s totally impossible to be fooled, especially for multiple people simultaneously, even by superintelligences that have a motive to deceive you.
The point of this exercise was to allow the unfriendly AI access not even to the most limited communication it might be able to accomplish using byproducts of computation. Having it produce an unconstrained textbook which a human actually reads is clearly absurd, but I’m not sure what you are trying to mock.
Do you think it is impossible to formulate any question which the AI can answer without exercising undue influence? It seems hard, but I don’t see why you would ridicule the possibility.
In first part of the grandparent, I attempted to strengthen the method. I don’t think automatic checking of a generated AI is anywhere feasible, and so there is little point in discussing that. But a textbook is clearly feasible as means of communicating a solution to a philosophical problem.. And then there are all those obvious old-hat failure modes resulting from the textbook, if only it could be constructed without possibility of causing psychological damage. Maybe we can even improve on that, by having several “generations” of researchers write a textbook for the next one, so that any brittle manipulation patterns contained in the original break.
I’m sorry, I misunderstood your intention completely, probably because of the italics :)
I personally am paranoid enough about ambivalent transhumans that I would be very afraid of giving them such a powerful channel to the outside world, even if they had to pass through many levels of iterative improvement (if I could make a textbook that hijacked your mind, then I could just have you write a similarly destructive textbook to the next researcher, etc.).
I think it is possible to exploit an unfriendly boxed AI to bootstrap to friendliness in a theoretically safe way. Minimally, the problem of doing it safely is very interesting and difficult, and I think I have a solid first step to a solution.
It’s an attractive idea for science fiction, but I think no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire. I just don’t think text and diagrams on a printed page, read slowly in the usual way, would have the bandwidth to rapidly and reliably “hack” any human. Needless to say, you would proceed with caution just in case.
I you don’t mind hurting a few volunteers to defend humanity from a much bigger threat, it should be fairly easy to detect, quarantine and possibly treat the damaged ones. They, after all, would only be ordinarily intelligent. Super-cunning existential-risk malevolence isn’t transitive.
I think a textbook full of sensory exploit hacks would be pretty valuable data in itself, but maybe I’m not a completely friendly natural intelligence ;-)
edit: Oh, I may have missed the point that of course you couldn’t trust the methods in the textbook for constructing FAI even if it itself posed no direct danger. Agreed.
Consider that humans can and have made such grimoires; they call them bibles. All it takes is a nonrational but sufficiently appealing idea and an imperfect rationalist falls to it. If there’s a true hole in the textbook’s information, such that it produces unfriendly AI instead of friendly, and the AI who wrote the textbook handwaved that hole away, how confident are you that you would spot the best hand-waving ever written?
Not confident at all. In fact I have seen no evidence for the possibility, even in principle, of provably friendly AI. And if there were such evidence, I wouldn’t be able to understand it well enough to evaluate it.
In fact I wouldn’t trust such a textbook even written by human experts whose motives I trusted. The problem isn’t proving the theorems, it’s choosing the axioms.
Can you think of any suitable candidates—apart from perhaps an oracle.
Subproblems like that seem likely to be rare.