Playing around with taboos, I think I might have come up with a short yet unambiguous definition of friendliness.
“A machine whose historical consequences, if compiled into a countable number of single-subject paragraphs and communicated, one paragraph at a time, to any human randomly selected from those alive at any time prior to the machine’s activation, would cause that human’s response (on a numerical scale representing approval or disapproval of the described events) to approach complete approval (as a limit) as the number of paragraphs thus communicated increases.”
Not a particularly practical definition, since testing it for an actual, implemented AGI would require at least one perfectly unbiased causality-violating journalist, but as far as I can tell it makes no reference to totally mysterious cognitive processes. Compiling actual events into a text narrative is still a black box, but strikes me as more tractable than something like ‘wisdom,’ since the work of historical scholars is open to analysis.
I’m probably missing something important. Could someone please point it out?
Human nature is more complicated by far than anyone’s conscious understanding of it. We might not know that future was missing something essential, if it were subtle enough. Your journalist ex machina might not even be able to communicate to us exactly what was missing, in a way that we could understand at our current level of intelligence.
A clarification: if even one human is ever found, out of the approx. 10^11 who have ever lived (to say nothing of multiple samples from the same human’s life) who would persist in disapproval of the future-history, the machine does not qualify.
I don’t think any machine could qualify. You’re requiring every human’s response to approach complete approval, and people’s preferences are too different.
Even without needing a unanimous verdict, I don’t think Everyone Who’s Ever Lived would make a good jury for this case.
It would be persuasive, and thus more likely to be friendly than an AI that doesn’t even concern itself enough with humans to bother persuading, but less likely than an AI that strived for genuine understanding of the truth in humans in this particular test (as an approximation) which would mean certain failure.
I’m fairly certain that creating a future which would persuade everyone just by being reported honestly requires genuine understanding, or something functionally indistinguishable therefrom.
The machine in question doesn’t actually need to be able to persuade, or, for that matter, communicate with humans in any capacity. The historical summary is complied, and pass/fail evaluation conducted, by an impartial observer, outside the relevant timeline—which, as I said, makes literal application of this test at the very least hopelessly impractical, maybe physically impossible.
I’m fairly certain that creating a future which would persuade everyone just by being reported honestly requires genuine understanding, or something functionally indistinguishable therefrom.
Your definition didn’t include “honestly”. And it didn’t even sort of vaguely imply neutral or unbiased.
The historical summary is complied, and pass/fail evaluation conducted, by an impartial observer, outside the relevant timeline -
You never mentioned that in your definition. And and defining an impartial observer seems to be a problem of comparable magnitude to defining friendliness in the first place. With a genuinely impartial observer who does not attempt to persuade there is no possibility of any future passing the test.
I referred to a compilation of all the machine’s historical consequences—in short, a map of it’s entire future light cone—in text form, possibly involving a countably infinite number of paragraphs. Did you assume that I was referring to a progress report compiled by the machine itself, or some other entity motivated to distort, obfuscate, and/or falsify?
I think you’re assuming people are harder to satisfy than they really are. A lot of people would be satisfied with (strictly truthful) statements along the lines of “While The Machine is active, neither you nor any of your allies or descendants suffer due to malnutrition, disease, injury, overwork, or torment by supernatural beings in the afterlife.”
Someone like David Icke? “Shortly after The Machine’s activation, no malevolent reptilians capable of humanoid disguise are alive on or near the Earth, nor do any arrive thereafter.”
I don’t mean to imply that the ‘approval survey’ process even involves cherrypicking the facts that would please a particular audience. An ideal Friendly AI would set up a situation that has something for everyone, without deal-breakers for anyone, and that looks impossible to us for the same reason a skyscraper looks impossible to termites.
Then again, some kinds of skyscrapers actually are impossible. If it turns out that satisfying everyone ever, or even pleasing half of them without enraging or horrifying the other half, is a literal, logical impossibility, degrees and percentages of satisfaction could still be a basis for comparison. It’s easier to shut up and multiply when actual numbers are involved.
Did you assume that I was referring to a progress report compiled by the machine itself, or some other entity motivated to distort, obfuscate, and/or falsify?
No, that the AI would necessarily end up doing that if friendliness was its super-goal and your paragraph the definition of friendliness.
I think you’re assuming people are harder to satisfy than they really are.
What would a future a genuine racist would be satisfied with look like? Would there be gay marriage in that future? Would sinners burn in hell? Remember, no attempts at persuasion so the racist won’t stop being racist, the homophobe being homophobe or the religious fanatic being a religious fanatic, no matter how long the report.
What would a future a genuine racist would be satisfied with look like?
The only time a person of {preferred ethnicity} fails to fulfill the potential of their heritage, or even comes within spitting range of a member of the {disfavored ethnicity}, is when they choose to do so.
Would there be gay marriage in that future?
Probably not. The gay people I’ve known who wanted to get married in the eyes of the law seemed to be motivated primarily by economic and medical issues, like taxation and visitation rights during hospitalization, which would be irrelevant in a post-scarcity environment.
Would sinners burn in hell?
Some of them would, anyway. There are a lot of underexplored intermediate options that the ‘sinful’ would consider amusing, or silly but harmless, and the ‘faithful’ could come to accept as consistent with their own limited understanding of God’s will.
I freely concede that I’ve mischaracterized the issues in question. There are a number of reasons why I’m not a professional diplomat. A real negotiator, let alone a real superintelligence, would have better solutions.
Would you disapprove as strongly of a future with complex and distasteful political compromises as you would one in which humanity as we know it is utterly destroyed? Remember, it’s a numerical scale, and the criterion isn’t unconditional approval but rather which direction you tend to move towards as more information is revealed.
Would you disapprove as strongly of a future with complex and distasteful political compromises as you would one in which humanity as we know it is utterly destroyed?
Of course not. But that’s not what your definition asks.
Remember, it’s a numerical scale, and the criterion isn’t unconditional approval but rather which direction you tend to move towards as more information is revealed.
In fact you specified “approach[ing] complete approval (as a limit)” which is a lot stronger claim than a mere tendency, it implies reaching arbitrary small differences to total approval, which effectively means unconditional approval once knowing as much as you can remember.
You’re right, I was moving the goalposts there. I stand by my original statement, on the grounds that an AGI with a brain the size of Jupiter would be considerably smarter than all modern human politicians and policymakers put together.
If an intransigent bigot fills up his and/or her memory capacity with easy-to-approve facts before anything controversial gets randomly doled out (which seems quite possible, since the set of facts that any given person will take offense at seems to be a miniscule subset of the set of facts which can be known), wouldn’t that count?
I don’t think that e. g. a Klan member would ever come close to complete approval of a word without knowing whether miscegenation was eliminated, people more easily remember what they feel strongly about so the “memory capacity” wouldn’t be filled with irrelevant details anyway, and if the hypothetical unbiased observer doesn’t select for relevant and interesting facts no one would listen long enough to get anywhere close to approval. Also for any AI to actually use the definition as written + later amendments you made it can’t just assume a particular order of paragraphs for a particular interviewee (or if it can we are back at persuasion skills, a sufficiently intelligent AI should be able to persuade anyone it models of anything by selecting the right paragraphs in the right order out of an infinitely long list), all possible sequences would have compete approval as a limit for all possible interviewees, or the same list has to be used for all interviewees.
I agree that it would be extremely difficult to find a world that, when completely and accurately described, would meet with effectively unconditional approval from both Rev. Dr. Martin Luther King, Jr. and a typical high-ranking member of the Ku Klux Klan. It’s almost certainly beyond the ability of any single human to do so directly…
Why, we’d need some sort of self-improving superintelligence just to map out the solution space in sufficient detail! Furthermore, it would need to have an extraordinarily deep understanding of, and willingness to pursue, those values which all humans share.
If it turns out to be impossible, well, that sucks. Time to look for the next-best option.
If the superintelligence makes some mistake or misinterpretation so subtle that a hundred billion humans studying the timeline for their entire lives (and then some) couldn’t spot it, how is that really a problem? I’m still not seeing how any machine could pass this test − 100% approval from the entire human race to date—without being Friendly.
I agree that it would be extremely difficult to find a world that, when completely and accurately described, would meet with effectively unconditional approval from both Rev. Dr. Martin Luther King, Jr. and a typical high-ranking member of the Ku Klux Klan.
Straight up impossible if their (apparent) values are still the same as before and they haven’t been mislead. If one agent prefers the absence of A to its presence, and another agent prefers the presence of A to its absence you cannot possibility satisfy both agents completely (without deliberately misleading at least one about A) . The solution can always be trivially improved for at least one agent by adding or removing A.
Actually, now that you invoke the unknowability of the far reaching capabilities of a superintelligence I thought of a very slight possibility of a word meeting your definition even though people have mutually contradictory values:
The world could be deliberately set up in a way that even a neutral third party description contained a fully general mind hack for human minds so that the AI could adjust the values of the hypothetical people tested trough the test. That’s almost certainly still impossible, but far more plausible than a word meeting the definition without any changing values, which would require all apparent value disagreements to be illusions and the world not to work in the way it appears to.
I think we can generalize that: Dissolving an apparent impossibility through the creative power of a super-intelligence should be far easier to do in an unfriendly way than doing the same in a friendly way, so a friendliness definition better had not contain any apparent impossibilities.
far more plausible than a word meeting the definition without any changing values,
I did not say or deliberately imply that nobody’s values would be changed by hearing an infallibly factual description of future events presented by a transcendant entity. In fact, that kind of experience is so powerful that unverified third-hand reports of it happening thousands of years ago retain enough impact to act as a recruiting tactic for several major religions.
which would require all apparent value disagreements to be illusions and the world not to work in the way it appears to.
Maybe not all, but certainly a lot of apparent value differences really are illusory. In third-world countries, genocide tends to flare up only after a drought leads to crop failures, suggesting that the real motivation is economic and racism is only used as an excuse, or a guide for who to kill without disrupting the social order more than absolutely necessary.
I think this is a lot less impossible than you’re trying to make it sound.
The stuff that people tend to get really passionate about, unwilling to compromise on, isn’t, in my experience, the global stuff. When someone says “I want less A” or “more A” they seem to mean “within range of my senses,” “in the environment where I’m likely to encounter it in the future” or “in my tribe’s territory or the territory of those we communicate with.” An arachnophobe wouldn’t panic upon hearing about a camel-spider three thousand miles away; if anything, the idea that none were on the same continent would be reassuring. An AI capable of terraforming galaxies might satisfy conflicting preferences by simply constructing an ideal environment for each, and somehow ensuring that everyone finds what they’re looking for.
The accurate description of such seemingly-impossible perfection would, in a sense, constitute a ‘fully general mind hack,’ in that it would convince anyone who can be convinced by the truth and satisfy anyone who can be satisfied within the laws of physics. If you know of a better standard, I’d like to hear it.
I’m not sure there is any point in continuing this. Once you allow the AI to optimize the human values it’s supposed to be tested against for test compability it’s over.
If, as you assert, pleasing everyone is impossible, and persuading anyone to accept something they wouldn’t otherwise be pleased by (even through a method as benign as giving them unlimited, factual knowledge of the consequences and allowing them to decide for themselves) is unFriendly, do you categorically reject the possibility of friendly AI?
If you think friendly AI is possible, but I’m going about it all wrong, what evidence would convince you that a given proposal was not equivalently flawed?
I’m not sure there is any point in continuing this.
I’m having some doubts, too. If you decide not to reply, I won’t press the issue.
and persuading anyone to accept something they wouldn’t otherwise be pleased by (even through a method as benign as giving them unlimited, factual knowledge of the consequences and allowing them to decide for themselves) is unFriendly,
No. Only if you allow acceptance to define friendliness. Leaving changing the definition of friendliness as an avenue to fulfill the goal defined as friendliness will almost certainly result in unfriendliness. Persuasion is not inherently unfriendly, provided it’s not used to short-circuit friendliness.
If you think friendly AI is possible, but I’m going about it all wrong, what evidence would convince you that a given proposal was not equivalently flawed?
As an absolute minimum it would need to be possible and not obviously exploitable. It should also not look like a hack. Ideally it should be understandable, give me an idea what an implementation might look like, be simple and elegant in design and seem rigorous enough to make me confident that the lack of visible holes is not a fact about the creativity in the looker.
Well, I’ll certainly concede that my suggestion fails the feasibility criterion, since a literal implementation might involve compiling a multiple-choice opinion poll with a countably infinite number of questions, translating it into every language and numbering system in history, and then presenting it to a number of subjects equal to the number of people who’ve ever lived multiplied by the average pre-singularity human lifespan in Planck-seconds multiplied by the number of possible orders in which those questions could be presented multiplied by the number of AI proposals under consideration.
I don’t mind. I was thinking about some more traditional flawed proposals, like the smile-maximizer, how they cast the net broadly enough to catch deeply Unfriendly outcomes, and decided to deliberately err in the other direction: design a test that would be too strict, that even a genuinely Friendly AI might not be able to pass, but that would definitely exclude any Unfriendly outcome.
I’m probably missing something important. Could someone please point it out?
That most people, historically, have been morons.
Basically the same question: Why are you limited to humans? Even supposing you could make a clean evolutionary cutoff (no one before Adam gets to vote), is possessing a particular set of DNA really an objective criterion for having a single vote on the fate of the universe?
There is no truly objective criterion for such decisionmaking, or at least none that you would consider fair or interesting in the least. The criterion is going to have to depend on human values, for the obvious reason that humans are the agents who get to decide what happens now (and yes, they could well decide that other agents get a vote too).
It’s not a matter of votes so much as veto power. CEV is the one where everybody, or at least their idealized version of themselves, gets a vote. In my plan, not everybody gets everything they want. The AI just says “I’ve thought it through, and this is how things are going to go,” then provides complete and truthful answers to any legitimate question you care to ask. Anything you don’t like about the plan, when investigated further, turns out to be either a misunderstanding on your part or a necessary consequence of some other feature that, once you think about it, is really more important.
Yes, most people historically have been morons. Are you saying that morons should have no rights, no opportunity for personal satisfaction or relevance to the larger world? Would you be happy with any AI that had equivalent degree of contempt for lesser beings?
There’s no particular need to limit it to humans, it’s just that humans have the most complicated requirements. If you want to add a few more orders of magnitude to the processing time and set aside a few planets just to make sure that everything macrobiotic has it’s own little happy hunting ground, go ahead.
Are you saying that morons should have no rights, no opportunity for personal satisfaction or relevance to the larger world?
Your scheme requires that the morons can be convinced of the correctness of the AI’s view by argumentation. If your scheme requires all humans to be perfect reasoners, you should mention that up front.
Playing around with taboos, I think I might have come up with a short yet unambiguous definition of friendliness.
“A machine whose historical consequences, if compiled into a countable number of single-subject paragraphs and communicated, one paragraph at a time, to any human randomly selected from those alive at any time prior to the machine’s activation, would cause that human’s response (on a numerical scale representing approval or disapproval of the described events) to approach complete approval (as a limit) as the number of paragraphs thus communicated increases.”
Not a particularly practical definition, since testing it for an actual, implemented AGI would require at least one perfectly unbiased causality-violating journalist, but as far as I can tell it makes no reference to totally mysterious cognitive processes. Compiling actual events into a text narrative is still a black box, but strikes me as more tractable than something like ‘wisdom,’ since the work of historical scholars is open to analysis.
I’m probably missing something important. Could someone please point it out?
Human nature is more complicated by far than anyone’s conscious understanding of it. We might not know that future was missing something essential, if it were subtle enough. Your journalist ex machina might not even be able to communicate to us exactly what was missing, in a way that we could understand at our current level of intelligence.
You roll a 16...
A clarification: if even one human is ever found, out of the approx. 10^11 who have ever lived (to say nothing of multiple samples from the same human’s life) who would persist in disapproval of the future-history, the machine does not qualify.
You roll a 19 :-)
I don’t think any machine could qualify. You’re requiring every human’s response to approach complete approval, and people’s preferences are too different.
Even without needing a unanimous verdict, I don’t think Everyone Who’s Ever Lived would make a good jury for this case.
Given that it’s possible, would you agree that any machine capable of satisfying such a rigorous standard would necessarily be Friendly?
It would be persuasive, and thus more likely to be friendly than an AI that doesn’t even concern itself enough with humans to bother persuading, but less likely than an AI that strived for genuine understanding of the truth in humans in this particular test (as an approximation) which would mean certain failure.
I’m fairly certain that creating a future which would persuade everyone just by being reported honestly requires genuine understanding, or something functionally indistinguishable therefrom.
The machine in question doesn’t actually need to be able to persuade, or, for that matter, communicate with humans in any capacity. The historical summary is complied, and pass/fail evaluation conducted, by an impartial observer, outside the relevant timeline—which, as I said, makes literal application of this test at the very least hopelessly impractical, maybe physically impossible.
Your definition didn’t include “honestly”. And it didn’t even sort of vaguely imply neutral or unbiased.
You never mentioned that in your definition. And and defining an impartial observer seems to be a problem of comparable magnitude to defining friendliness in the first place. With a genuinely impartial observer who does not attempt to persuade there is no possibility of any future passing the test.
I referred to a compilation of all the machine’s historical consequences—in short, a map of it’s entire future light cone—in text form, possibly involving a countably infinite number of paragraphs. Did you assume that I was referring to a progress report compiled by the machine itself, or some other entity motivated to distort, obfuscate, and/or falsify?
I think you’re assuming people are harder to satisfy than they really are. A lot of people would be satisfied with (strictly truthful) statements along the lines of “While The Machine is active, neither you nor any of your allies or descendants suffer due to malnutrition, disease, injury, overwork, or torment by supernatural beings in the afterlife.” Someone like David Icke? “Shortly after The Machine’s activation, no malevolent reptilians capable of humanoid disguise are alive on or near the Earth, nor do any arrive thereafter.”
I don’t mean to imply that the ‘approval survey’ process even involves cherrypicking the facts that would please a particular audience. An ideal Friendly AI would set up a situation that has something for everyone, without deal-breakers for anyone, and that looks impossible to us for the same reason a skyscraper looks impossible to termites.
Then again, some kinds of skyscrapers actually are impossible. If it turns out that satisfying everyone ever, or even pleasing half of them without enraging or horrifying the other half, is a literal, logical impossibility, degrees and percentages of satisfaction could still be a basis for comparison. It’s easier to shut up and multiply when actual numbers are involved.
No, that the AI would necessarily end up doing that if friendliness was its super-goal and your paragraph the definition of friendliness.
What would a future a genuine racist would be satisfied with look like? Would there be gay marriage in that future? Would sinners burn in hell? Remember, no attempts at persuasion so the racist won’t stop being racist, the homophobe being homophobe or the religious fanatic being a religious fanatic, no matter how long the report.
The only time a person of {preferred ethnicity} fails to fulfill the potential of their heritage, or even comes within spitting range of a member of the {disfavored ethnicity}, is when they choose to do so.
Probably not. The gay people I’ve known who wanted to get married in the eyes of the law seemed to be motivated primarily by economic and medical issues, like taxation and visitation rights during hospitalization, which would be irrelevant in a post-scarcity environment.
Some of them would, anyway. There are a lot of underexplored intermediate options that the ‘sinful’ would consider amusing, or silly but harmless, and the ‘faithful’ could come to accept as consistent with their own limited understanding of God’s will.
Then I would not approve of that future. And I don’t even care that much about Gay rights compared to other issues or how much some other people do.
(leaving aside your mischaratcerizations of the incompatibilities caused by racists and fanatics)
I freely concede that I’ve mischaracterized the issues in question. There are a number of reasons why I’m not a professional diplomat. A real negotiator, let alone a real superintelligence, would have better solutions.
Would you disapprove as strongly of a future with complex and distasteful political compromises as you would one in which humanity as we know it is utterly destroyed? Remember, it’s a numerical scale, and the criterion isn’t unconditional approval but rather which direction you tend to move towards as more information is revealed.
Of course not. But that’s not what your definition asks.
In fact you specified “approach[ing] complete approval (as a limit)” which is a lot stronger claim than a mere tendency, it implies reaching arbitrary small differences to total approval, which effectively means unconditional approval once knowing as much as you can remember.
You’re right, I was moving the goalposts there. I stand by my original statement, on the grounds that an AGI with a brain the size of Jupiter would be considerably smarter than all modern human politicians and policymakers put together.
If an intransigent bigot fills up his and/or her memory capacity with easy-to-approve facts before anything controversial gets randomly doled out (which seems quite possible, since the set of facts that any given person will take offense at seems to be a miniscule subset of the set of facts which can be known), wouldn’t that count?
I don’t think that e. g. a Klan member would ever come close to complete approval of a word without knowing whether miscegenation was eliminated, people more easily remember what they feel strongly about so the “memory capacity” wouldn’t be filled with irrelevant details anyway, and if the hypothetical unbiased observer doesn’t select for relevant and interesting facts no one would listen long enough to get anywhere close to approval. Also for any AI to actually use the definition as written + later amendments you made it can’t just assume a particular order of paragraphs for a particular interviewee (or if it can we are back at persuasion skills, a sufficiently intelligent AI should be able to persuade anyone it models of anything by selecting the right paragraphs in the right order out of an infinitely long list), all possible sequences would have compete approval as a limit for all possible interviewees, or the same list has to be used for all interviewees.
I agree that it would be extremely difficult to find a world that, when completely and accurately described, would meet with effectively unconditional approval from both Rev. Dr. Martin Luther King, Jr. and a typical high-ranking member of the Ku Klux Klan. It’s almost certainly beyond the ability of any single human to do so directly…
Why, we’d need some sort of self-improving superintelligence just to map out the solution space in sufficient detail! Furthermore, it would need to have an extraordinarily deep understanding of, and willingness to pursue, those values which all humans share.
If it turns out to be impossible, well, that sucks. Time to look for the next-best option.
If the superintelligence makes some mistake or misinterpretation so subtle that a hundred billion humans studying the timeline for their entire lives (and then some) couldn’t spot it, how is that really a problem? I’m still not seeing how any machine could pass this test − 100% approval from the entire human race to date—without being Friendly.
Straight up impossible if their (apparent) values are still the same as before and they haven’t been mislead. If one agent prefers the absence of A to its presence, and another agent prefers the presence of A to its absence you cannot possibility satisfy both agents completely (without deliberately misleading at least one about A) . The solution can always be trivially improved for at least one agent by adding or removing A.
Actually, now that you invoke the unknowability of the far reaching capabilities of a superintelligence I thought of a very slight possibility of a word meeting your definition even though people have mutually contradictory values:
The world could be deliberately set up in a way that even a neutral third party description contained a fully general mind hack for human minds so that the AI could adjust the values of the hypothetical people tested trough the test. That’s almost certainly still impossible, but far more plausible than a word meeting the definition without any changing values, which would require all apparent value disagreements to be illusions and the world not to work in the way it appears to.
I think we can generalize that: Dissolving an apparent impossibility through the creative power of a super-intelligence should be far easier to do in an unfriendly way than doing the same in a friendly way, so a friendliness definition better had not contain any apparent impossibilities.
I did not say or deliberately imply that nobody’s values would be changed by hearing an infallibly factual description of future events presented by a transcendant entity. In fact, that kind of experience is so powerful that unverified third-hand reports of it happening thousands of years ago retain enough impact to act as a recruiting tactic for several major religions.
Maybe not all, but certainly a lot of apparent value differences really are illusory. In third-world countries, genocide tends to flare up only after a drought leads to crop failures, suggesting that the real motivation is economic and racism is only used as an excuse, or a guide for who to kill without disrupting the social order more than absolutely necessary.
I think this is a lot less impossible than you’re trying to make it sound.
The stuff that people tend to get really passionate about, unwilling to compromise on, isn’t, in my experience, the global stuff. When someone says “I want less A” or “more A” they seem to mean “within range of my senses,” “in the environment where I’m likely to encounter it in the future” or “in my tribe’s territory or the territory of those we communicate with.” An arachnophobe wouldn’t panic upon hearing about a camel-spider three thousand miles away; if anything, the idea that none were on the same continent would be reassuring. An AI capable of terraforming galaxies might satisfy conflicting preferences by simply constructing an ideal environment for each, and somehow ensuring that everyone finds what they’re looking for.
The accurate description of such seemingly-impossible perfection would, in a sense, constitute a ‘fully general mind hack,’ in that it would convince anyone who can be convinced by the truth and satisfy anyone who can be satisfied within the laws of physics. If you know of a better standard, I’d like to hear it.
I’m not sure there is any point in continuing this. Once you allow the AI to optimize the human values it’s supposed to be tested against for test compability it’s over.
If, as you assert, pleasing everyone is impossible, and persuading anyone to accept something they wouldn’t otherwise be pleased by (even through a method as benign as giving them unlimited, factual knowledge of the consequences and allowing them to decide for themselves) is unFriendly, do you categorically reject the possibility of friendly AI?
If you think friendly AI is possible, but I’m going about it all wrong, what evidence would convince you that a given proposal was not equivalently flawed?
I’m having some doubts, too. If you decide not to reply, I won’t press the issue.
No. Only if you allow acceptance to define friendliness. Leaving changing the definition of friendliness as an avenue to fulfill the goal defined as friendliness will almost certainly result in unfriendliness. Persuasion is not inherently unfriendly, provided it’s not used to short-circuit friendliness.
As an absolute minimum it would need to be possible and not obviously exploitable. It should also not look like a hack. Ideally it should be understandable, give me an idea what an implementation might look like, be simple and elegant in design and seem rigorous enough to make me confident that the lack of visible holes is not a fact about the creativity in the looker.
Well, I’ll certainly concede that my suggestion fails the feasibility criterion, since a literal implementation might involve compiling a multiple-choice opinion poll with a countably infinite number of questions, translating it into every language and numbering system in history, and then presenting it to a number of subjects equal to the number of people who’ve ever lived multiplied by the average pre-singularity human lifespan in Planck-seconds multiplied by the number of possible orders in which those questions could be presented multiplied by the number of AI proposals under consideration.
I don’t mind. I was thinking about some more traditional flawed proposals, like the smile-maximizer, how they cast the net broadly enough to catch deeply Unfriendly outcomes, and decided to deliberately err in the other direction: design a test that would be too strict, that even a genuinely Friendly AI might not be able to pass, but that would definitely exclude any Unfriendly outcome.
Please taboo the word ‘hack.’
That most people, historically, have been morons.
Basically the same question: Why are you limited to humans? Even supposing you could make a clean evolutionary cutoff (no one before Adam gets to vote), is possessing a particular set of DNA really an objective criterion for having a single vote on the fate of the universe?
There is no truly objective criterion for such decisionmaking, or at least none that you would consider fair or interesting in the least. The criterion is going to have to depend on human values, for the obvious reason that humans are the agents who get to decide what happens now (and yes, they could well decide that other agents get a vote too).
It’s not a matter of votes so much as veto power. CEV is the one where everybody, or at least their idealized version of themselves, gets a vote. In my plan, not everybody gets everything they want. The AI just says “I’ve thought it through, and this is how things are going to go,” then provides complete and truthful answers to any legitimate question you care to ask. Anything you don’t like about the plan, when investigated further, turns out to be either a misunderstanding on your part or a necessary consequence of some other feature that, once you think about it, is really more important.
Yes, most people historically have been morons. Are you saying that morons should have no rights, no opportunity for personal satisfaction or relevance to the larger world? Would you be happy with any AI that had equivalent degree of contempt for lesser beings?
There’s no particular need to limit it to humans, it’s just that humans have the most complicated requirements. If you want to add a few more orders of magnitude to the processing time and set aside a few planets just to make sure that everything macrobiotic has it’s own little happy hunting ground, go ahead.
Your scheme requires that the morons can be convinced of the correctness of the AI’s view by argumentation. If your scheme requires all humans to be perfect reasoners, you should mention that up front.
See the posts linked from
http://wiki.lesswrong.com/wiki/Complexity_of_value
http://wiki.lesswrong.com/wiki/Fake_simplicity
http://wiki.lesswrong.com/wiki/Magical_categories
You might also try my restatement.