My point is that specific behaviors is not the kind of thing that we can make decisions about in programming a FAI, so I don’t see how “iff we program it to” applies to a question of plausibility of a specific behavior. Rather, we can talk of which behaviors seem more or less plausible given what abstract properties the idea of “FAI” assumes, and depending on other parameters that influence a particular variant of its implementation (such as whether it optimizes human or chimp values). So on that level, it’s not plausible that FAI would start torturing people or maximizing paperclips, and these properties are not within variation of what the concept includes.
there is a class of artificial intelligence algorithms which can be considered ‘friendly’ and within that class there are algorithms that would reward and other algorithms which would not reward.
“Things that are mostly Friendly” is a huge class in which humanly constructible FAIs are a tiny dot (I expect we can either do a perfect job or none at all, while it’s theoretically but not humanly possible to create almost-perfect-but-not-quite FAIs). I’m talking about that dot, and I expect within that dot, the answer to this question is determined one way or the other, and we don’t know which. Is it actually the correct decision to “reward FAI’s creators”? If it is, FAI does it, if it’s not, FAI doesn’t do it. Whether programmers want it to be done doesn’t plausibly influence whether it’s the correct thing to do, and FAI does the correct thing, or it’s not a FAI.
(More carefully, it’s not even clear what the question means, since it compares counterfactuals, and there is still no reliable theory of counterfactual reasoning. Like, “What do you mean, if we did that other thing? Look at what actually happened.” More usefully, the question is probably wrong in the sense that it poses a false dilemma, assumes things some of which will likely break.)
My point is that specific behaviors is not the kind of thing that we can make decisions about in programming a FAI, so I don’t see how “iff we program it to” applies to a question of plausibility of a specific behavior.
There is more than one way to program an FAI—see for example CEV which is currently ambiguous. There are also different individuals or groups of individuals which an AI can be friendly to and still qualify as “Friendly Enough” to warrant the label. It is likely that the actual (and coherently extrapolatable) preferences of humans differ with respect to whether rewarding AI-encouragers is a good thing.
“Things that are mostly Friendly” is a huge class in which humanly constructible FAIs are a tiny dot (I expect we can either do a perfect job or none at all, while it’s theoretically but not humanly possible to create almost-perfect-but-not-quite FAIs). I’m talking about that dot, and I expect within that dot, the answer to this question is determined one way or the other, and we don’t know which.
I’m pleasantly surprised. It seems that we disagree with respect to actual predictions about the universe rather than the expected, and more common “just miscommunication/responding to a straw man”. Within that dot the answer is not determined!
Whether programmers want it to be done doesn’t plausibly influence whether it’s the correct thing to do, and FAI does the correct thing, or it’s not a FAI.
I’m familiar with the point—and make it myself rather frequently. It does not apply here—due to the aforementioned rejection of the “determined within the dot” premise.
How likely do you think it is that all humanly-buildable AGIs converge on whatever FAI converges on in less time than it takes for a typical black hole to evaporate? (Eghggh. Time breaks down around singularities (at least from a human perspective) so I can’t phrase this right, but maybe you get my gist.)
My point is that specific behaviors is not the kind of thing that we can make decisions about in programming a FAI, so I don’t see how “iff we program it to” applies to a question of plausibility of a specific behavior. Rather, we can talk of which behaviors seem more or less plausible given what abstract properties the idea of “FAI” assumes, and depending on other parameters that influence a particular variant of its implementation (such as whether it optimizes human or chimp values). So on that level, it’s not plausible that FAI would start torturing people or maximizing paperclips, and these properties are not within variation of what the concept includes.
“Things that are mostly Friendly” is a huge class in which humanly constructible FAIs are a tiny dot (I expect we can either do a perfect job or none at all, while it’s theoretically but not humanly possible to create almost-perfect-but-not-quite FAIs). I’m talking about that dot, and I expect within that dot, the answer to this question is determined one way or the other, and we don’t know which. Is it actually the correct decision to “reward FAI’s creators”? If it is, FAI does it, if it’s not, FAI doesn’t do it. Whether programmers want it to be done doesn’t plausibly influence whether it’s the correct thing to do, and FAI does the correct thing, or it’s not a FAI.
(More carefully, it’s not even clear what the question means, since it compares counterfactuals, and there is still no reliable theory of counterfactual reasoning. Like, “What do you mean, if we did that other thing? Look at what actually happened.” More usefully, the question is probably wrong in the sense that it poses a false dilemma, assumes things some of which will likely break.)
There is more than one way to program an FAI—see for example CEV which is currently ambiguous. There are also different individuals or groups of individuals which an AI can be friendly to and still qualify as “Friendly Enough” to warrant the label. It is likely that the actual (and coherently extrapolatable) preferences of humans differ with respect to whether rewarding AI-encouragers is a good thing.
I’m pleasantly surprised. It seems that we disagree with respect to actual predictions about the universe rather than the expected, and more common “just miscommunication/responding to a straw man”. Within that dot the answer is not determined!
I’m familiar with the point—and make it myself rather frequently. It does not apply here—due to the aforementioned rejection of the “determined within the dot” premise.
How likely do you think it is that all humanly-buildable AGIs converge on whatever FAI converges on in less time than it takes for a typical black hole to evaporate? (Eghggh. Time breaks down around singularities (at least from a human perspective) so I can’t phrase this right, but maybe you get my gist.)