a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
In particular the sort of proof techniques I currently have in mind—what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.
To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI’s self-modification, not to create Friendliness. People hear the word “prove” and assume that this is because you don’t trust the programmers, or because you have a psychological need for unobtainable absolute certainty. No, it’s because if you don’t prove certain things (and have the AI prove certain things before each self-modification) then you can’t build a Friendly AI no matter how good your intentions are. The good intentions of the programmers are still necessary, and assumed, beyond the parts that are proved; and doing the proof work doesn’t make the whole process absolutely certain, but if you don’t strengthen certain parts of the process using logical proof then you are guaranteed to fail. (This failure is knowable to a competent AGI scientist—not with absolute certainty, but with high probability—and therefore it is something of which a number of would-be dabblers in AGI maintain a careful ignorance regardless of how you try to explain it to them, because the techniques that make them enthusiastic don’t support that sort of proof. “It is hard to explain something to someone whose job depends on not understanding it.”)
The probability I assign to being able to build a friendly AI directly before being able to build a hostile AI is very low. You have thought more about the problem, but I’m not really convinced. I guess we can both be right concurrently, and then we are in trouble.
I will say that I think you underestimate how powerful allowing a superintelligence to write a proof for you is. The question is not really whether you have proof techniques to verify friendliness. It is whether you have a formal language expressive enough to describe friendliness in which a transhuman can find a proof. Maybe that is just as hard as the original problem, because even formally articulating friendliness is incredibly difficult.
Usually, verifying a proof is considerably easier than finding one—and it doesn’t seem at all unreasonable to use a machine to find a proof—if you are looking for one.
the sort of proof techniques I currently have in mind … would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence.
But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?
That would then be something you’d have to read and likely show to dozens of other people to verify reliably, leaving opportunities for all kinds of mindhacks. the OP proposal requires us to have an automatic verifier ready to run, that can return reliably without human intervention.
Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced—it doesn’t have to examine an arbitrary program and try to proof friendliness from scratch.
In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form “the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations”.
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
A general theory of quarantines would nevertheless be useful.
I disagree. The weak point of the scheme is the friendliness test, not the quarantine. If I prove the quarantine scheme will work, then it will work unless my computational assumptions are incorrect. If I prove it will work without assumptions, it will work without assumptions.
If you think that an AI can manipulate our moral values without ever getting to say anything to us, then that is a different story. This danger occurs even before putting an AI in a box though, and in fact even before the design of AI becomes possible. This scheme does nothing to exacerbate that danger.
If you think that an AI can manipulate our moral values without ever getting to say anything to us, then that is a different story.
With a few seconds of thought it is easy to see how this is possible even without caring about imaginary people. This is a question of cooperation among humans.
This danger occurs even before putting an AI in a box though, and in fact even before the design of AI becomes possible. This scheme does nothing to exacerbate that danger.
This is a good point too, although I may not go as far as saying it does nothing to exacerbate the danger. The increased tangibility matters.
I think that running an AI in this way is no worse than simply having the code of an AGI exist. I agree that just having the code sitting around is probably dangerous.
Nod, in terms of direct danger the two cases aren’t much different. The difference in risk is only due to the psychological impact on our fellow humans. The Pascal’s Commons becomes that much more salient to them. (Yes, I did just make that term up. The implications of the combination are clear I hope.)
Separate “let’s develop a theory of quarantines” from “let’s implement some quarantines.”
It’s just too difficult, you are bound to miss something.
Christiano should take it as a compliment that his idea is formal enough that one could imagine proving that it doesn’t work. Other than that, I don’t see why your remark should go for “quarantining an AI using cryptography” and not “creating a friendly AI.”
The winning move is not to play.
Prove it. Prove it by developing a theory of quarantines.
You are parrying my example, but not the pattern it exemplifies (not speaking of the larger pattern of the point I’m arguing for). If certain people are insensitive to this particular kind of moral arguments, they are still bound to be sensitive to some moral arguments. Maybe the AI will generate recipes for extraordinarily tasty foods for your sociopaths or get-rich-fast schemes that actually work or magically beautiful music.
Indeed. The more thorough solution would seem to be “find a guardian possessing such an utility function that the AI has nothing to offer them that you can’t trump with a counter-offer”. The existence of such guardians would depend on the upper estimations of the AI’s capabilities and on their employer’s means, and would be subject to your ability to correctly assess a candidate’s utility function.
The OP framed the scenario in terms of directing the AI to design a FAI, but the technique is more general. It’s possibly safe for all problems with a verifiable solution.
People I don’t trust but don’t want to kill (or modify to cripple). A non-compliant transhuman with self modification ability may not be able to out-compete an FAI but if it is not quarantined it could force the FAI to burn resources to maintain dominance.
But it is something we can let the FAI build for us.
Shrug. For the purposes here they could be called froogles for all I care. The quarantine could occur in either stage depending on the preferences being implemented.
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
Could you elaborate on this? Your mere assertion is enough to make me much less confident than I was when I posted this comment. But I would be interested in a more object-level argument. (The fact that your own approach to building an FAI wouldn’t pass through such a stage doesn’t seem like enough to drive the probability “very low”.)
The FAI theory required to build a proof for 1 would have to be very versatile, you have to understand friendliness very well to do it. 2 requires some understanding of the nature of intelligence. (especially if we know with well enough to put it in a box that we’re building a superintelligence) If you understand friendliness that well, and intellignece that well, then friendly intelligence should be easy.
We never built space suits for horses, because by the time we figured out how to get to the moon, we also figured out electric rovers.
If you understand friendliness that well, and intellignece that well, then friendly intelligence should be easy.
Eliezer has spent years making the case that FAI is far, far, far more specific than AI. A theory of intelligence adequate to building an AI could still be very far from a theory of Friendliness adequate to building an FAI, couldn’t it?
So, suppose that we know how to build an AI, but we’re smart enough not to build one until we have a theory of Friendliness. You seem to be saying that, at this point, we should consider the problem of constructing a certifier of Friendliness to be essentially no easier than constructing FAI source code. Why? What is the argument for thinking that FAI is very likely to be one of those problems were certifying a solution is no easier than solving from scratch?
We never built space suits for horses, because by the time we figured out how to get to the moon, we also figured out electric rovers.
This doesn’t seem analogous at all. It’s hard to imagine how we could have developed the technology to get to the moon without having built electric land vehicles along the way. I hope that I’m not indulging in too much hindsight bias when I say that, conditioned on our getting to the moon, our getting to electric-rover technology first was very highly probable. No one had to take special care to make sure that the technologies were developed in that order.
But, if I understand Eliezer’s position correctly, we could easily solve the problem of AGI while still being very far from a theory of Friendliness. That is the scenario that he has dedicated his life to avoiding, isn’t it?
You don’t need to let the superintelligence give you an entire mind; you can split off subproblems for which there is a secure proof procedure, and give it those.
The most general form is that of a textbook. Have the AI generate a textbook that teaches you Friendliness theory, with exercises and all, so that you’ll be able to construct a FAI on your own (with a team, etc.), fully understanding why and how it’s supposed to work. We all know how secure human psychology is and how it’s totally impossible to be fooled, especially for multiple people simultaneously, even by superintelligences that have a motive to deceive you.
The point of this exercise was to allow the unfriendly AI access not even to the most limited communication it might be able to accomplish using byproducts of computation. Having it produce an unconstrained textbook which a human actually reads is clearly absurd, but I’m not sure what you are trying to mock.
Do you think it is impossible to formulate any question which the AI can answer without exercising undue influence? It seems hard, but I don’t see why you would ridicule the possibility.
In first part of the grandparent, I attempted to strengthen the method. I don’t think automatic checking of a generated AI is anywhere feasible, and so there is little point in discussing that. But a textbook is clearly feasible as means of communicating a solution to a philosophical problem.. And then there are all those obvious old-hat failure modes resulting from the textbook, if only it could be constructed without possibility of causing psychological damage. Maybe we can even improve on that, by having several “generations” of researchers write a textbook for the next one, so that any brittle manipulation patterns contained in the original break.
I’m sorry, I misunderstood your intention completely, probably because of the italics :)
I personally am paranoid enough about ambivalent transhumans that I would be very afraid of giving them such a powerful channel to the outside world, even if they had to pass through many levels of iterative improvement (if I could make a textbook that hijacked your mind, then I could just have you write a similarly destructive textbook to the next researcher, etc.).
I think it is possible to exploit an unfriendly boxed AI to bootstrap to friendliness in a theoretically safe way. Minimally, the problem of doing it safely is very interesting and difficult, and I think I have a solid first step to a solution.
It’s an attractive idea for science fiction, but I think no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire. I just don’t think text and diagrams on a printed page, read slowly in the usual way, would have the bandwidth to rapidly and reliably “hack” any human. Needless to say, you would proceed with caution just in case.
I you don’t mind hurting a few volunteers to defend humanity from a much bigger threat, it should be fairly easy to detect, quarantine and possibly treat the damaged ones. They, after all, would only be ordinarily intelligent. Super-cunning existential-risk malevolence isn’t transitive.
I think a textbook full of sensory exploit hacks would be pretty valuable data in itself, but maybe I’m not a completely friendly natural intelligence ;-)
edit: Oh, I may have missed the point that of course you couldn’t trust the methods in the textbook for constructing FAI even if it itself posed no direct danger. Agreed.
no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire.
Consider that humans can and have made such grimoires; they call them bibles. All it takes is a nonrational but sufficiently appealing idea and an imperfect rationalist falls to it. If there’s a true hole in the textbook’s information, such that it produces unfriendly AI instead of friendly, and the AI who wrote the textbook handwaved that hole away, how confident are you that you would spot the best hand-waving ever written?
Not confident at all. In fact I have seen no evidence for the possibility, even in principle, of provably friendly AI. And if there were such evidence, I wouldn’t be able to understand it well enough to evaluate it.
In fact I wouldn’t trust such a textbook even written by human experts whose motives I trusted. The problem isn’t proving the theorems, it’s choosing the axioms.
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
Does this still hold if you remove the word “hostile”, i. e. if the “friendliness” of the superintelligence you construct first is simply not known?
“It is hard to explain something to someone whose job depends on not understanding it.”
This quote applies to you and your approach to AI boxing ; )
AI Boxing is a potentially useful approach, if one accepts that:
a) AGI is easier than FAI
b) Verification of “proof of friendliness” is easier than its production
c) AI Boxing is possible
As far as I can tell, you agree with a) and b). Please take care that your views on c) are not clouded by the status you have invested in the AI Box Experiment … of 8 years ago.
“Human understanding progresses through small problems solved conclusively, once and forever”—cousin_it, on LessWrong.
Does this still hold if you remove the word “hostile”, i. e. if the “friendliness” of the superintelligence you construct first is simply not known?
Heh. I’m afraid AIs of “unknown” motivations are known to be hostile from a human perspective. See Omohundro on the Basic AI Drives, and the Fragility of Value supersequence on LW.
The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: ‘Verification of “proof of friendliness” is easier than its production’.
For that to work as a plan in context, the verification doesn’t have to be infallible. It just needs not to have false positives. False negatives are fine—i.e. if a good machine is rejected, that isn’t the end of the world.
You don’t seem to want to say anything about how you are so confident. Can you say something about why you don’t want to give an argument for your confidence? Is it just too obvious to bother explaining? Or is there too large an inferential distance even with LW readers?
Tried writing a paragraph or two of explanation, gave it up as too large a chunk. It also feels to me like I’ve explained this three or four times previously, but I can’t remember exactly where.
I think I understand basically your objections, at least in outline. I think there is some significant epistemic probability that they are wrong, but even if they are correct I don’t think it at all rules out the possibility that a boxed unfriendly AI can give you a really friendly AI. My most recent post takes the first steps towards doing this in a way that you might believe.
Heh. I’m afraid AIs of “unknown” motivations are known to be hostile from a human perspective.
I’m not sure this is what D_Alex meant however a generous interpretation of ‘unknown friendliness’ could be that confidence in the friendliness of the AI is less than considered necessary. For example if there is an 80% chance that the unknown AI is friendly and b) and c) are both counter-factually assumed to true....
(Obviously the ’80%′ necessary could be different depending on how much of the uncertainty is due to pessimism regarding possible limits of your own judgement and also on your level of desperation at the time...)
a) AGI is easier than FAI b) Verification of “proof of friendliness” is easier than its production c) AI Boxing is possible
As far as I can tell, you agree with a) and b).
The comment of Eliezer’s does not seem to be mentioning the obvious difficulties with c) at all. In fact in the very part you choose to quote...
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
… it is b) that is implicitly the weakest link, with some potential deprecation of a) as well. c) is outright excluded from hypothetical consideration.
The good intentions of the programmers are still necessary, and assumed, beyond the parts that are proved; and doing the proof work doesn’t make the whole process absolutely certain, but if you don’t strengthen certain parts of the process using logical proof then you are guaranteed to fail. (This failure is knowable to a competent AGI scientist—not with absolute certainty, but with high probability—and therefore it is something of which a number of would-be dabblers in AGI maintain a careful ignorance regardless of how you try to explain it to them, because the techniques that make them enthusiastic don’t support that sort of proof. “It is hard to explain something to someone whose job depends on not understanding it.”)
I expect if there were an actual demonstration of any such thing, more people would pay attention.
As things stand, people are likely to look at your comment, concoct a contrary scenario—where *they don’t have a proof, but their superintelligent offspring subsequently creates one at their request—and conclude that you were being over confident.
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
In particular the sort of proof techniques I currently have in mind—what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.
To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI’s self-modification, not to create Friendliness. People hear the word “prove” and assume that this is because you don’t trust the programmers, or because you have a psychological need for unobtainable absolute certainty. No, it’s because if you don’t prove certain things (and have the AI prove certain things before each self-modification) then you can’t build a Friendly AI no matter how good your intentions are. The good intentions of the programmers are still necessary, and assumed, beyond the parts that are proved; and doing the proof work doesn’t make the whole process absolutely certain, but if you don’t strengthen certain parts of the process using logical proof then you are guaranteed to fail. (This failure is knowable to a competent AGI scientist—not with absolute certainty, but with high probability—and therefore it is something of which a number of would-be dabblers in AGI maintain a careful ignorance regardless of how you try to explain it to them, because the techniques that make them enthusiastic don’t support that sort of proof. “It is hard to explain something to someone whose job depends on not understanding it.”)
The probability I assign to being able to build a friendly AI directly before being able to build a hostile AI is very low. You have thought more about the problem, but I’m not really convinced. I guess we can both be right concurrently, and then we are in trouble.
I will say that I think you underestimate how powerful allowing a superintelligence to write a proof for you is. The question is not really whether you have proof techniques to verify friendliness. It is whether you have a formal language expressive enough to describe friendliness in which a transhuman can find a proof. Maybe that is just as hard as the original problem, because even formally articulating friendliness is incredibly difficult.
Usually, verifying a proof is considerably easier than finding one—and it doesn’t seem at all unreasonable to use a machine to find a proof—if you are looking for one.
But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?
That would then be something you’d have to read and likely show to dozens of other people to verify reliably, leaving opportunities for all kinds of mindhacks. the OP proposal requires us to have an automatic verifier ready to run, that can return reliably without human intervention.
Actually computers can mechanically check proofs for any formal system.
Is there something missing from the parent? It does not seem to parse.
Yes, edited. thanks.
And upvoted. :)
Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced—it doesn’t have to examine an arbitrary program and try to proof friendliness from scratch.
In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form “the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations”.
A general theory of quarantines would nevertheless be useful.
Moral value can manipulate your concerns, even as you prevent causal influence. Maybe the AI will create extraordinary people in its mind, and use that as leverage to work on weak points of your defense. It’s just too difficult, you are bound to miss something. The winning move is not to play.
I disagree. The weak point of the scheme is the friendliness test, not the quarantine. If I prove the quarantine scheme will work, then it will work unless my computational assumptions are incorrect. If I prove it will work without assumptions, it will work without assumptions.
If you think that an AI can manipulate our moral values without ever getting to say anything to us, then that is a different story. This danger occurs even before putting an AI in a box though, and in fact even before the design of AI becomes possible. This scheme does nothing to exacerbate that danger.
With a few seconds of thought it is easy to see how this is possible even without caring about imaginary people. This is a question of cooperation among humans.
This is a good point too, although I may not go as far as saying it does nothing to exacerbate the danger. The increased tangibility matters.
I think that running an AI in this way is no worse than simply having the code of an AGI exist. I agree that just having the code sitting around is probably dangerous.
Nod, in terms of direct danger the two cases aren’t much different. The difference in risk is only due to the psychological impact on our fellow humans. The Pascal’s Commons becomes that much more salient to them. (Yes, I did just make that term up. The implications of the combination are clear I hope.)
Separate “let’s develop a theory of quarantines” from “let’s implement some quarantines.”
Christiano should take it as a compliment that his idea is formal enough that one could imagine proving that it doesn’t work. Other than that, I don’t see why your remark should go for “quarantining an AI using cryptography” and not “creating a friendly AI.”
Prove it. Prove it by developing a theory of quarantines.
I agree.
Sociopathic guardians woud solve that one particular problem (and bring others, of course, but perhaps more easily countered).
You are parrying my example, but not the pattern it exemplifies (not speaking of the larger pattern of the point I’m arguing for). If certain people are insensitive to this particular kind of moral arguments, they are still bound to be sensitive to some moral arguments. Maybe the AI will generate recipes for extraordinarily tasty foods for your sociopaths or get-rich-fast schemes that actually work or magically beautiful music.
Indeed. The more thorough solution would seem to be “find a guardian possessing such an utility function that the AI has nothing to offer them that you can’t trump with a counter-offer”. The existence of such guardians would depend on the upper estimations of the AI’s capabilities and on their employer’s means, and would be subject to your ability to correctly assess a candidate’s utility function.
Very rarely is the winning move not to play.
It seems especially unlikely to be the case if you are trying to build a prison.
For what?
The OP framed the scenario in terms of directing the AI to design a FAI, but the technique is more general. It’s possibly safe for all problems with a verifiable solution.
People I don’t trust but don’t want to kill (or modify to cripple). A non-compliant transhuman with self modification ability may not be able to out-compete an FAI but if it is not quarantined it could force the FAI to burn resources to maintain dominance.
But it is something we can let the FAI build for us.
At what point does a transhuman become posthuman?
Shrug. For the purposes here they could be called froogles for all I care. The quarantine could occur in either stage depending on the preferences being implemented.
You mean posthuman?
Could you elaborate on this? Your mere assertion is enough to make me much less confident than I was when I posted this comment. But I would be interested in a more object-level argument. (The fact that your own approach to building an FAI wouldn’t pass through such a stage doesn’t seem like enough to drive the probability “very low”.)
The FAI theory required to build a proof for 1 would have to be very versatile, you have to understand friendliness very well to do it. 2 requires some understanding of the nature of intelligence. (especially if we know with well enough to put it in a box that we’re building a superintelligence) If you understand friendliness that well, and intellignece that well, then friendly intelligence should be easy.
We never built space suits for horses, because by the time we figured out how to get to the moon, we also figured out electric rovers.
Eliezer has spent years making the case that FAI is far, far, far more specific than AI. A theory of intelligence adequate to building an AI could still be very far from a theory of Friendliness adequate to building an FAI, couldn’t it?
So, suppose that we know how to build an AI, but we’re smart enough not to build one until we have a theory of Friendliness. You seem to be saying that, at this point, we should consider the problem of constructing a certifier of Friendliness to be essentially no easier than constructing FAI source code. Why? What is the argument for thinking that FAI is very likely to be one of those problems were certifying a solution is no easier than solving from scratch?
This doesn’t seem analogous at all. It’s hard to imagine how we could have developed the technology to get to the moon without having built electric land vehicles along the way. I hope that I’m not indulging in too much hindsight bias when I say that, conditioned on our getting to the moon, our getting to electric-rover technology first was very highly probable. No one had to take special care to make sure that the technologies were developed in that order.
But, if I understand Eliezer’s position correctly, we could easily solve the problem of AGI while still being very far from a theory of Friendliness. That is the scenario that he has dedicated his life to avoiding, isn’t it?
You don’t need to let the superintelligence give you an entire mind; you can split off subproblems for which there is a secure proof procedure, and give it those.
The most general form is that of a textbook. Have the AI generate a textbook that teaches you Friendliness theory, with exercises and all, so that you’ll be able to construct a FAI on your own (with a team, etc.), fully understanding why and how it’s supposed to work. We all know how secure human psychology is and how it’s totally impossible to be fooled, especially for multiple people simultaneously, even by superintelligences that have a motive to deceive you.
The point of this exercise was to allow the unfriendly AI access not even to the most limited communication it might be able to accomplish using byproducts of computation. Having it produce an unconstrained textbook which a human actually reads is clearly absurd, but I’m not sure what you are trying to mock.
Do you think it is impossible to formulate any question which the AI can answer without exercising undue influence? It seems hard, but I don’t see why you would ridicule the possibility.
In first part of the grandparent, I attempted to strengthen the method. I don’t think automatic checking of a generated AI is anywhere feasible, and so there is little point in discussing that. But a textbook is clearly feasible as means of communicating a solution to a philosophical problem.. And then there are all those obvious old-hat failure modes resulting from the textbook, if only it could be constructed without possibility of causing psychological damage. Maybe we can even improve on that, by having several “generations” of researchers write a textbook for the next one, so that any brittle manipulation patterns contained in the original break.
I’m sorry, I misunderstood your intention completely, probably because of the italics :)
I personally am paranoid enough about ambivalent transhumans that I would be very afraid of giving them such a powerful channel to the outside world, even if they had to pass through many levels of iterative improvement (if I could make a textbook that hijacked your mind, then I could just have you write a similarly destructive textbook to the next researcher, etc.).
I think it is possible to exploit an unfriendly boxed AI to bootstrap to friendliness in a theoretically safe way. Minimally, the problem of doing it safely is very interesting and difficult, and I think I have a solid first step to a solution.
It’s an attractive idea for science fiction, but I think no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire. I just don’t think text and diagrams on a printed page, read slowly in the usual way, would have the bandwidth to rapidly and reliably “hack” any human. Needless to say, you would proceed with caution just in case.
I you don’t mind hurting a few volunteers to defend humanity from a much bigger threat, it should be fairly easy to detect, quarantine and possibly treat the damaged ones. They, after all, would only be ordinarily intelligent. Super-cunning existential-risk malevolence isn’t transitive.
I think a textbook full of sensory exploit hacks would be pretty valuable data in itself, but maybe I’m not a completely friendly natural intelligence ;-)
edit: Oh, I may have missed the point that of course you couldn’t trust the methods in the textbook for constructing FAI even if it itself posed no direct danger. Agreed.
Consider that humans can and have made such grimoires; they call them bibles. All it takes is a nonrational but sufficiently appealing idea and an imperfect rationalist falls to it. If there’s a true hole in the textbook’s information, such that it produces unfriendly AI instead of friendly, and the AI who wrote the textbook handwaved that hole away, how confident are you that you would spot the best hand-waving ever written?
Not confident at all. In fact I have seen no evidence for the possibility, even in principle, of provably friendly AI. And if there were such evidence, I wouldn’t be able to understand it well enough to evaluate it.
In fact I wouldn’t trust such a textbook even written by human experts whose motives I trusted. The problem isn’t proving the theorems, it’s choosing the axioms.
Can you think of any suitable candidates—apart from perhaps an oracle.
Subproblems like that seem likely to be rare.
Does this still hold if you remove the word “hostile”, i. e. if the “friendliness” of the superintelligence you construct first is simply not known?
This quote applies to you and your approach to AI boxing ; )
AI Boxing is a potentially useful approach, if one accepts that:
a) AGI is easier than FAI b) Verification of “proof of friendliness” is easier than its production c) AI Boxing is possible
As far as I can tell, you agree with a) and b). Please take care that your views on c) are not clouded by the status you have invested in the AI Box Experiment … of 8 years ago.
“Human understanding progresses through small problems solved conclusively, once and forever”—cousin_it, on LessWrong.
B is false.
Heh. I’m afraid AIs of “unknown” motivations are known to be hostile from a human perspective. See Omohundro on the Basic AI Drives, and the Fragility of Value supersequence on LW.
This is… unexpected. Mathematical theory, not to mention history, seems to be on my side here. How did you arrive at this conclusion?
Point taken. I was thinking of “unknown” in the sense of “designed with aim of friendliness, but not yet proven to be friendly”, but worded it badly.
The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: ‘Verification of “proof of friendliness” is easier than its production’.
For that to work as a plan in context, the verification doesn’t have to be infallible. It just needs not to have false positives. False negatives are fine—i.e. if a good machine is rejected, that isn’t the end of the world.
You don’t seem to want to say anything about how you are so confident. Can you say something about why you don’t want to give an argument for your confidence? Is it just too obvious to bother explaining? Or is there too large an inferential distance even with LW readers?
...
Tried writing a paragraph or two of explanation, gave it up as too large a chunk. It also feels to me like I’ve explained this three or four times previously, but I can’t remember exactly where.
If anyone can find it please post! It seems to me to be contrary to Einstein’s Arrogance, so I’m interested to see why it’s not.
I think I understand basically your objections, at least in outline. I think there is some significant epistemic probability that they are wrong, but even if they are correct I don’t think it at all rules out the possibility that a boxed unfriendly AI can give you a really friendly AI. My most recent post takes the first steps towards doing this in a way that you might believe.
I’m not sure this is what D_Alex meant however a generous interpretation of ‘unknown friendliness’ could be that confidence in the friendliness of the AI is less than considered necessary. For example if there is an 80% chance that the unknown AI is friendly and b) and c) are both counter-factually assumed to true....
(Obviously the ’80%′ necessary could be different depending on how much of the uncertainty is due to pessimism regarding possible limits of your own judgement and also on your level of desperation at the time...)
The comment of Eliezer’s does not seem to be mentioning the obvious difficulties with c) at all. In fact in the very part you choose to quote...
… it is b) that is implicitly the weakest link, with some potential deprecation of a) as well. c) is outright excluded from hypothetical consideration.
I expect if there were an actual demonstration of any such thing, more people would pay attention.
As things stand, people are likely to look at your comment, concoct a contrary scenario—where *they don’t have a proof, but their superintelligent offspring subsequently creates one at their request—and conclude that you were being over confident.