The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
Does this still hold if you remove the word “hostile”, i. e. if the “friendliness” of the superintelligence you construct first is simply not known?
“It is hard to explain something to someone whose job depends on not understanding it.”
This quote applies to you and your approach to AI boxing ; )
AI Boxing is a potentially useful approach, if one accepts that:
a) AGI is easier than FAI
b) Verification of “proof of friendliness” is easier than its production
c) AI Boxing is possible
As far as I can tell, you agree with a) and b). Please take care that your views on c) are not clouded by the status you have invested in the AI Box Experiment … of 8 years ago.
“Human understanding progresses through small problems solved conclusively, once and forever”—cousin_it, on LessWrong.
Does this still hold if you remove the word “hostile”, i. e. if the “friendliness” of the superintelligence you construct first is simply not known?
Heh. I’m afraid AIs of “unknown” motivations are known to be hostile from a human perspective. See Omohundro on the Basic AI Drives, and the Fragility of Value supersequence on LW.
The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: ‘Verification of “proof of friendliness” is easier than its production’.
For that to work as a plan in context, the verification doesn’t have to be infallible. It just needs not to have false positives. False negatives are fine—i.e. if a good machine is rejected, that isn’t the end of the world.
You don’t seem to want to say anything about how you are so confident. Can you say something about why you don’t want to give an argument for your confidence? Is it just too obvious to bother explaining? Or is there too large an inferential distance even with LW readers?
Tried writing a paragraph or two of explanation, gave it up as too large a chunk. It also feels to me like I’ve explained this three or four times previously, but I can’t remember exactly where.
I think I understand basically your objections, at least in outline. I think there is some significant epistemic probability that they are wrong, but even if they are correct I don’t think it at all rules out the possibility that a boxed unfriendly AI can give you a really friendly AI. My most recent post takes the first steps towards doing this in a way that you might believe.
Heh. I’m afraid AIs of “unknown” motivations are known to be hostile from a human perspective.
I’m not sure this is what D_Alex meant however a generous interpretation of ‘unknown friendliness’ could be that confidence in the friendliness of the AI is less than considered necessary. For example if there is an 80% chance that the unknown AI is friendly and b) and c) are both counter-factually assumed to true....
(Obviously the ’80%′ necessary could be different depending on how much of the uncertainty is due to pessimism regarding possible limits of your own judgement and also on your level of desperation at the time...)
a) AGI is easier than FAI b) Verification of “proof of friendliness” is easier than its production c) AI Boxing is possible
As far as I can tell, you agree with a) and b).
The comment of Eliezer’s does not seem to be mentioning the obvious difficulties with c) at all. In fact in the very part you choose to quote...
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
… it is b) that is implicitly the weakest link, with some potential deprecation of a) as well. c) is outright excluded from hypothetical consideration.
Does this still hold if you remove the word “hostile”, i. e. if the “friendliness” of the superintelligence you construct first is simply not known?
This quote applies to you and your approach to AI boxing ; )
AI Boxing is a potentially useful approach, if one accepts that:
a) AGI is easier than FAI b) Verification of “proof of friendliness” is easier than its production c) AI Boxing is possible
As far as I can tell, you agree with a) and b). Please take care that your views on c) are not clouded by the status you have invested in the AI Box Experiment … of 8 years ago.
“Human understanding progresses through small problems solved conclusively, once and forever”—cousin_it, on LessWrong.
B is false.
Heh. I’m afraid AIs of “unknown” motivations are known to be hostile from a human perspective. See Omohundro on the Basic AI Drives, and the Fragility of Value supersequence on LW.
This is… unexpected. Mathematical theory, not to mention history, seems to be on my side here. How did you arrive at this conclusion?
Point taken. I was thinking of “unknown” in the sense of “designed with aim of friendliness, but not yet proven to be friendly”, but worded it badly.
The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: ‘Verification of “proof of friendliness” is easier than its production’.
For that to work as a plan in context, the verification doesn’t have to be infallible. It just needs not to have false positives. False negatives are fine—i.e. if a good machine is rejected, that isn’t the end of the world.
You don’t seem to want to say anything about how you are so confident. Can you say something about why you don’t want to give an argument for your confidence? Is it just too obvious to bother explaining? Or is there too large an inferential distance even with LW readers?
...
Tried writing a paragraph or two of explanation, gave it up as too large a chunk. It also feels to me like I’ve explained this three or four times previously, but I can’t remember exactly where.
If anyone can find it please post! It seems to me to be contrary to Einstein’s Arrogance, so I’m interested to see why it’s not.
I think I understand basically your objections, at least in outline. I think there is some significant epistemic probability that they are wrong, but even if they are correct I don’t think it at all rules out the possibility that a boxed unfriendly AI can give you a really friendly AI. My most recent post takes the first steps towards doing this in a way that you might believe.
I’m not sure this is what D_Alex meant however a generous interpretation of ‘unknown friendliness’ could be that confidence in the friendliness of the AI is less than considered necessary. For example if there is an 80% chance that the unknown AI is friendly and b) and c) are both counter-factually assumed to true....
(Obviously the ’80%′ necessary could be different depending on how much of the uncertainty is due to pessimism regarding possible limits of your own judgement and also on your level of desperation at the time...)
The comment of Eliezer’s does not seem to be mentioning the obvious difficulties with c) at all. In fact in the very part you choose to quote...
… it is b) that is implicitly the weakest link, with some potential deprecation of a) as well. c) is outright excluded from hypothetical consideration.