Alternate scenario 1: AI wants to find out something that only human beings from a particular era would know, brings them back as simulations as a side-effect of the process it uses to extract their memories, and then doesn’t particularly care about giving them a pleasant environment to exist in.
Alternate scenario 2: failed Friendly AI brings people back and tortures them because some human programmed it with a concept of “heaven” that has a hideously unfortunate implication.
Alternate scenario 2: failed Friendly AI brings people back and tortures them because some human programmed it with a concept of “heaven” that has a hideously unfortunate implication.
Good news: this one’s remarkably unlikely, since almost all existing Friendly AI approaches are indirect (“look at some samples of real humans and optimize for the output of some formally-specified epistemic procedure for determining their values”) rather than direct (“choirs of angels sing to the Throne of God”).
Not sure how that helps. Would you prefer scenario 2b, with “[..] because its formally-specified epistemic procedure for determining the values of its samples of real humans results in a concept of value-maximization that has a hideously unfortunate implication.”?
You’re saying that enacting the endorsed values of real people taken at reflective equilibrium has an unfortunate implication? To whom? Surely not to the people whose values you’re enacting. Which does leave population-ethics a biiiiig open question for FAI development, but it at least means the people whose values you feed to the Seed AI get what they want.
No, I’m saying that (in scenario 2b) enacting the result of a formally-specified epistemic procedure has an unfortunate implication. Unfortunate to everyone, including the people who were used as the sample against which that procedure ran.
I’m not sure what the communication failure here is. The whole point is to construct algorithms that extrapolate the value-set of the input people. By doing so, you thus extrapolate a moral code that the input people can definitely endorse, hence the phrase “right by definition”. So where is the unfortunate implication coming from?
A third-party guess: It’s coming from a flaw in the formal specification of the epistemic procedure. That it is formally specified is not a guarantee that it is the specification we would want. It could rest on a faulty assumption, or take a step that appears justified but in actuality is slightly wrong.
Basically, formal specification is a good idea, but not a get-out-of-trouble-free card.
Replying elsewhere. Suffice to say, nobody would call it a “get out of trouble free” card. More like, get out of trouble after decades of prerequisite hard work, which is precisely why various forms of the hard work are being done now, decades before any kind of AGI is invented, let alone foom-flavored ultra-AI.
I’m not sure either. Let me back up a little… from my perspective, the exchange looks something like this:
ialdabaoth: what if failed FAI is incorrectly implemented and fucks things up? eli_sennesh: that won’t happen, because the way we produce FAI will involve an algorithm that looks at human brains and reverse-engineers their values, which then get implemented. theOtherDave: just because the target specification is being produced by an algorithm doesn’t mean its results won’t fuck things up e_s: yes it does, because the algorithm is a formally-specified epistemic procedure, which means its results are right by definition. tOD: wtf?
So perhaps the problem is that I simply don’t understand why it is that a formally-specified epistemic procedure running on my brain to extract the target specification for a powerful optimization process should be guaranteed not to fuck things up.
Ah, ok. I’m going to have to double-reply here, and my answer should be taken as a personal perspective. This is actually an issue I’ve been thinking about and conversing over with an FHI guy, I’d like to hear any thoughts someone might have.
Basically, we want to extract a coherent set of terminal goals from human beings. So far, the approach to this problem is from two angles:
1) Neuroscience/neuroethics/neuroeconomics: look at how the human brain actually makes choices, and attempt to describe where and how in the brain terminal values are rooted. See: Paul Christiano’s “indirect normativity” write-up.
2) Pure ethics: there are lots of impulses in the brain that feed into choice, so instead of just picking one of those, let’s sit down and do the moral philosophy on how to “think out” our terminal values. See: CEV, “reflective equilibrium”, “what we want to want”, concepts like that.
My personal opinion is that we also need to add:
3) Population ethics: given the ability to extract values from one human, we now need to sample lots of humans and come up with an ethically sound way of combining the resulting goal functions (“where our wishes cohere rather than interfere”, blah blah blah) to make an optimization metric that works for everyone, even if it’s not quite maximally perfect for every single individual (that is, Shlomo might prefer everyone be Jewish, Abed might prefer everyone be Muslim, John likes being secular just fine, the combined and extrapolated goal function doesn’t perform mandatory religious conversions on anyone).
Now! Here’s where we get to the part where we avoid fucking things up! At least in my opinion, and as a proposal I’ve put forth myself, if we really have an accurate model of human morality, then we should be able to implement the value-extraction process on some experimental subjects, predictively generate a course of action through our model behind closed doors, run an experiment on serious moral decision-making, and then find afterwards that (without having seen the generated proposals before) our subjects’ real decisions either matched the predicted ones, or our subjects endorse the predicted ones.
That is, ideally, we should be able to test our notion of how to epistemically describe morality before we ever make that epistemic procedure or its outputs the goal metric for a Really Powerful Optimization Process. Short of things like bugs in the code or cosmic rays, we would thus (assuming we have time to carry out all the research before $YOUR_GEOPOLITICAL_ENEMY unleashes a paper-clipper For the Evulz) have a good idea what’s going to happen before we take a serious risk.
So, if I’ve understood your proposal, we could summarize it as: Step 1: we run the value-extractor (seed AI, whatever) on group G and get V. Step 2: we run a simulation of using V as the target for our optimizer. Step 3: we show the detailed log of that simulation to G, and/or we ask G various questions about their preferences and see whether their answers match the simulation. Step 4: based on the results of step 3, we decide whether to actually run our optimizer on V.
Have I basically understood you?
If so, I have two points, one simple and boring, one more complicated and interesting.
The simple one is that this process depends critically on our simulation mechanism being reliable. If there’s a design flaw in the simulator such that the simulation is wonderful but the actual results of running our optimizer is awful, the result of this process is that we endorse a wonderful world and create a completely different awful world and say “oops.”
So I still don’t see how this avoids the possibility of unfortunate implications. More generally, I don’t think anything we can do will avoid that possibility. We simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher.
The more interesting one… well, let’s assume that we do steps 1-3. Step 4 is where I get lost. I’ve been stuck on this point for years.
I see step 4 going like this:
Some members of G (G1) say “Hey, awesome, sign me up!”
Other members of G (G2) say “I guess? I mean, I kind of thought there would be more $currently_held_sacred_value, but if your computer says this is what I actually want, well, who am I to argue with a computer?”
G3 says “You know, that’s not bad, but what would make it even better is if the bikeshed were painted yellow.”
G4 says “Wait, what? You’re telling me that my values, extrapolated and integrated with everyone else’s and implemented in the actual world, look like that?!? But… but… that’s awful! I mean, that world doesn’t have any $currently_held_sacred_value! No, I can’t accept that.”
G5 says “Yeah, whatever. When’s lunch?”
…and so on.
Then we stare at all that and pull out our hair. Is that a successful test? Who knows? What were we expecting, anyway… that all of G would be in G1? Why would we expect that? Even if V is perfectly correct… why would we expect mere humans to reliably endorse it?
Similarly, if we ask G a bunch of questions to elicit their revealed preferences/decisions and compare those to the results of the simulation, I expect that we’ll find conflicting answers. Some things match up, others don’t, some things depend on who we ask or how we ask them or whether they’ve eaten lunch recently.
Actually, I think the real situation is even more extreme than that. This whole enterprise depends on the idea that we have actual values… the so-called “terminal” ones, which we mostly aren’t aware of right now, but are what we would want if we “learned together and grew together and yadda yadda”… which are more mutually reconcilable than the surface values that we claim to want or think we want (e.g., “everyone in the world embraces Jesus Christ in their hearts,” “everybody suffers as I’ve suffered,” “I rule the world!”).
If that’s true, it seems to me we should expect that the result of a superhuman intelligence optimizing the world for our terminal values and ignoring our surface values to seem alien and incomprehensible and probably kind of disgusting to the people we are right now.
And at that point we have to ask, what do we trust more? Our own brains, which say “BOO!”, or the value-extractor/optimizer/simulator we’ve built, which says “no, really, it turns out this is what you actually want; trust me.”?
If the answer to that is not immediately “we trust the software far more than our own fallible brains” we have clearly done something wrong.
But… in that case, why bother with the simulator at all? Just implement V, never mind what we think about it. Our thoughts are merely the reports of an obsolete platform; we have better tools now.
This is a special case of a more general principle: when we build tools that are more reliable than our own brains, and our brains disagree with the tools, we should ignore our own brains and obey the tools. Once a self-driving car is good enough, allowing human drivers to override it is at best unnecessary and at worst stupidly dangerous.
Similarly… this whole enterprise depends on building a machine that’s better at knowing what I really want than I am. Once we’ve built the machine, asking me what I want is at best unnecessary and at worst stupidly dangerous.
The simple one is that this process depends critically on our simulation mechanism being reliable. If there’s a design flaw in the simulator such that the simulation is wonderful but the actual results of running our optimizer is awful, the result of this process is that we endorse a wonderful world and create a completely different awful world and say “oops.”
The idea is not to run a simulation of a tiny little universe, merely a simulation of a few people’s moral decision processes. Basically, run a program that prints out what our proposed FAI would have done given some situations, show that to our sample people, and check if they actually endorse the proposed course of action.
(There’s another related proposal for getting Friendly AI called value learning, which I’ve been scrawling notes on today. Basically, the idea is that the AI will keep a pool of possible utility functions (which are consistent, VNM-rational utility functions by construction), and we’ll use some evidence about humans to rate the probability that a given utility function is Friendly. Depending on the details of this whole process and the math actually working out, you would get a learning agent that steadily refines its utility function to be more and more one that humans can endorse.)
The more interesting one… well, let’s assume that we do steps 1-3.
Step 4 is where I get lost. I’ve been stuck on this point for years.
I see step 4 going like this:
Some members of G (G1) say “Hey, awesome, sign me up!”
Other members of G (G2) say “I guess? I mean, I kind of thought there would be more $currently_held_sacred_value, but if your computer says this is what I actually want, well, who am I to argue with a computer?”
G3 says “You know, that’s not bad, but what would make it even better is if the bikeshed were painted yellow.”
G4 says “Wait, what? You’re telling me that my values, extrapolated and integrated with everyone else’s and implemented in the actual world, look like that?!? But… but… that’s awful! I mean, that world doesn’t have any $currently_held_sacred_value! No, I can’t accept that.”
G5 says “Yeah, whatever. When’s lunch?” …and so on.
This is why I did actually say that population ethics is a wide-open problem in machine ethics. Meaning, yes, the population has broken into political factions. Humans have a noted tendency to do that.
Now, the whole point of Coherent Extrapolated Volition on a population-ethics level was to employ a fairly simple population-ethical heuristic: “where our wishes cohere rather than interfere”. Which, it seems to me, means: if people’s wishes run against each-other, do nothing at all, do something only if there exists unanimous/near-unanimous/supermajority agreement. It’s very democratic, in its way, but it will probably also end up implementing only the lowest common denominator. The result I expect to see from a naive all-humanity CEV with that population-ethic is something along the lines of, “People’s health is vastly improved, mortality becomes optional, food ripens more easily and is tastier, and everyone gets a house. You humans couldn’t actually agree on much more.”
Which is all pretty well and good, but it’s not much more than we could have gotten without going to the trouble of building a Friendly Artificial Intelligence!
Population ethics will also be rather much of a problem, I would venture to guess, because our tiny little ape-brains don’t have any hardware or instincts for large-scale population ethics. Our default moral machinery, when we are born, is configured to be loyal to our tribe, to help our friends and family, and to kill the shit out of the other tribe.
It will take some solid new intellectual developments in ethics to actually come up with some real way of dealing with the problem you’ve stated. I mean, sure, you could just go to the other extreme and say an FAI should shoot for population-ethical Pareto optimality, but that requires defining population-ethical Pareto optimality. For instance, is a person with many friends and loved-ones worth more to your notion of Pareto-optimality because insufficiently catering to their wishes hurts their friends and loved-ones, versus a hermit who has no close associates to be hurt by his Pareto-optimal misery?
What I can say, at the very least, is that prospective AI designers and an FAI itself should listen hard to groups G4 and G2 and figure out, ideally, how to actually fix the goal function so that G1 they shift over to one of the groups actively assenting. We should assume endorsement of FAI is like consent to sex: active enthusiasm should be the minimum standard of endorsement, with quiet acquiescence a strong sign of “NO! STOP!”.
Actually, I think the real situation is even more extreme than that. This whole enterprise depends on the idea that we have actual values… the so-called “terminal” ones, which we mostly aren’t aware of right now, but are what we would want if we “learned together and grew together and yadda yadda”… which are more mutually reconcilable than the surface values that we claim to want or think we want (e.g., “everyone in the world embraces Jesus Christ in their hearts,” “everybody suffers as I’ve suffered,” “I rule the world!”).
Reality is what it is. We certainly can’t build a real FAI by slicing a goat’s neck within the bounds of a pentagram at 1am on the Temple Mount in Jerusalem chanting ancient Canaanite prayers that translate as “Come forth and optimize for our values!”. A prospective FAI-maker would have to actually know and understand moral psychology and neuro-ethics on a scientific level, which would give them a damn good idea of what sort of preference function they’re extracting.
(By the way, the whole point of reflective equilibrium is to wash out really stupid ideas like “everyone suffers as I’ve suffered”, which has never actually done anything for anyone.)
If that’s true, it seems to me we should expect that the result of a superhuman intelligence optimizing the world for our terminal values and ignoring our surface values to seem alien and incomprehensible and probably kind of disgusting to the people we are right now.
Supposedly, yeah. I would certainly hope that a real FAI would understand about how people prefer gradual transitions and not completely overthrow everything all at once to any degree greater than strictly necessary.
And at that point we have to ask, what do we trust more? Our own brains, which say “BOO!”, or the value-extractor/optimizer/simulator we’ve built, which says “no, really, it turns out this is what you actually want; trust me.”?
Well, why are you proposing to trust the program’s output? Check with the program’s creators, check the program’s construction, and check it’s input. It’s not an eldritch tome of lore, it’s a series of scientific procedures. The whole great, awesome thing about science is that results are checkable, in principle, by anyone with the resources to do so.
Run the extraction process over again! Do repeated experiments, possibly even on separate samples, yield similar answers? If so, well, it may well be that the procedure is trustworthy, in that it does what its creators have specified that it does.
In which case, check that the specification does definitely correspond, in some way, to “extract what we really want from what we supposedly want.”
run a program that prints out what our proposed FAI would have done given some situations, show that to our sample people, and check if they actually endorse the proposed course of action.
So, suppose we do this, and we conclude that our FAI is in fact capable of reliably proposing courses of actions that, in general terms, people endorse.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree? Because back at the beginning of this conversation, it sounded like you were claiming you had in mind a process that was guaranteed not to fuck up, which is what I was skeptical about.
There’s another related proposal [...] we’ll use some evidence about humans to rate the probability that a given utility function is Friendly
Well, I certainly expect that to work better than not using evidence. Beyond that, I’m really not sure what to say about it. Here again… suppose this procedure works wonderfully, and as a consequence of climbing that hill we end up with a consistent set of VNM-rational utility functions that humans reliably endorse when they read about them.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree?
Now you might reply “Well it’s the best we can do!” and I might agree. As I said earlier, we simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher. But let’s not pretend there’s no chance of failure.
yes, the population has broken into political factions
I’m not sure I would describe those subgroups as political factions, necessarily… they’re just people expressing opinions at this stage. But sure, I could imagine analogous political factions.
the whole point of CEV was to employ a fairly simple population-ethical heuristic: “where our wishes cohere rather than interfere”. [..] You humans couldn’t actually agree on much more.
Well, now, this is a different issue. I actually agree with you here, but I was assuming for the sake of argument that the CEV paradigm actually works, and gets a real, worthwhile converged result from G. That is, I’m assuming for the same of comity that G actually would, if they were “more the people they wished to be” and so on and so forth in all the poetic language of the CEV paper, agree on V, and that our value-extractor somehow figures that out because it’s really well-designed.
My point was that it doesn’t follow from that that G as they actually are will agree on V.
(By the way, the whole point of reflective equilibrium is to wash out really stupid ideas like “everyone suffers as I’ve suffered”, which has never actually done anything for anyone.)
Sure, I agree—both that that’s the point of RE, and that ESAIS is a really stupid (though popular) idea.
But reflective equilibrium is a method with an endpoint we approach asymptotically. The degree of reflective equilibrium humans can reliably achieve after being put in a quiet, air-conditioned room for twenty minutes, fed nutritious food and played soothing music for that time, and then asked questions is less than that which we can achieve after ten years or two hundred years.
At some point, we have to define a target of how much reflective equilibrium we expect from our input, and from our evaluators. The further we shift our target away from where we are right now, the more really stupid ideas we will wash out, and the less likely we are to endorse the result. The further we shift it towards where we are, the more stupid ideas we keep, and the more likely we are to endorse the result.
I would certainly hope that a real FAI would understand about how people prefer gradual transitions and not completely overthrow everything all at once to any degree greater than strictly necessary.
I feel like we’re just talking past each other at this point, actually. I’m not talking about how quickly the FAI optimizes the world, I’m talking about whether we are likely to endorse the result of extracting our actual values.
My point was that it doesn’t follow from that that G as they actually are will agree on V.
I’m talking about whether we are likely to endorse the result of extracting our actual values.
OOOOOOOOOOOOOOOOOOOOH. Ah. Ok. That is actually an issue, yes! Sorry I didn’t get what you meant before!
My answer is: that is an open problem, in the sense that we kind of need to know much more about neuro-ethics to answer it. It’s certainly easy to imagine scenarios in which, for instance, the FAI proposes to make all humans total moral exemplars, and as a result all the real humans who secretly like being sinful, even if they don’t endorse it, reject the deal entirely.
Yes, we have several different motivational systems, and the field of machine ethics tends to brush this under the rug by referring to everything as “human values” simply because the machine-ethics folks tend to contrast humans with paper-clippers to make a point about why machine-ethics experts are necessary.
This kind of thing is an example of the consideration that needs to be done to get somewhere. You are correct in saying that if FAI designers want their proposals to be accepted by the public (or even the general body of the educated elite) they need to cater not only to meta-level moral wishes but to actual desires and affections real people feel today. I would certainly argue this is an important component of Friendliness design.
At some point, we have to define a target of how much reflective equilibrium we expect from our input, and from our evaluators. The further we shift our target away from where we are right now, the more really stupid ideas we will wash out, and the less likely we are to endorse the result. The further we shift it towards where we are, the more stupid ideas we keep, and the more likely we are to endorse the result.
This assumes that people are unlikely to endorse smart ideas. I personally disagree: many ideas are difficult to locate in idea-space, but easy to evaluate. Life extension, for example, or marriage for romance.
Because back at the beginning of this conversation, it sounded like you were claiming you had in mind a process that was guaranteed not to fuck up, which is what I was skeptical about.
No, I have not solved AI Friendliness all on my lonesome. That would be a ridiculous claim, a crackpot sort of claim. I just have a bunch of research notes that, even with their best possible outcome, leave lots of open questions and remaining issues.
Now you might reply “Well it’s the best we can do!” and I might agree. As I said earlier, we simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher. But let’s not pretend there’s no chance of failure.
Certainly there’s a chance of failure. I just think there’s a lot we can and should do to reduce that chance. The potential rewards are simply too great not to.
For scenario 1, it would almost certainly require less free energy just to get the information directly from the brain without ever bringing the person to consciousness.
For scenario 2, you would seriously consider suicide if you fear that a failed friendly AI might soon be developed. Indeed, since there is a chance you will become incapacitated (say by falling into a coma) you might want to destroy your brain long before such an AI could arise.
Alternate scenario 1: AI wants to find out something that only human beings from a particular era would know, brings them back as simulations as a side-effect of the process it uses to extract their memories, and then doesn’t particularly care about giving them a pleasant environment to exist in.
Alternate scenario 2: failed Friendly AI brings people back and tortures them because some human programmed it with a concept of “heaven” that has a hideously unfortunate implication.
Good news: this one’s remarkably unlikely, since almost all existing Friendly AI approaches are indirect (“look at some samples of real humans and optimize for the output of some formally-specified epistemic procedure for determining their values”) rather than direct (“choirs of angels sing to the Throne of God”).
Not sure how that helps. Would you prefer scenario 2b, with “[..] because its formally-specified epistemic procedure for determining the values of its samples of real humans results in a concept of value-maximization that has a hideously unfortunate implication.”?
You’re saying that enacting the endorsed values of real people taken at reflective equilibrium has an unfortunate implication? To whom? Surely not to the people whose values you’re enacting. Which does leave population-ethics a biiiiig open question for FAI development, but it at least means the people whose values you feed to the Seed AI get what they want.
No, I’m saying that (in scenario 2b) enacting the result of a formally-specified epistemic procedure has an unfortunate implication. Unfortunate to everyone, including the people who were used as the sample against which that procedure ran.
Why? The whole point of a formally-specified epistemic procedure is that, with respect to the people taken as samples, it is right by definition.
Wonderful. Then the unfortunate implication will be right, by definition.
So what?
I’m not sure what the communication failure here is. The whole point is to construct algorithms that extrapolate the value-set of the input people. By doing so, you thus extrapolate a moral code that the input people can definitely endorse, hence the phrase “right by definition”. So where is the unfortunate implication coming from?
A third-party guess: It’s coming from a flaw in the formal specification of the epistemic procedure. That it is formally specified is not a guarantee that it is the specification we would want. It could rest on a faulty assumption, or take a step that appears justified but in actuality is slightly wrong.
Basically, formal specification is a good idea, but not a get-out-of-trouble-free card.
Replying elsewhere. Suffice to say, nobody would call it a “get out of trouble free” card. More like, get out of trouble after decades of prerequisite hard work, which is precisely why various forms of the hard work are being done now, decades before any kind of AGI is invented, let alone foom-flavored ultra-AI.
Reply.
I have no idea if this is the communication failure, but I certainly would agree with this comment.
Thanks!
I’m not sure either. Let me back up a little… from my perspective, the exchange looks something like this:
ialdabaoth: what if failed FAI is incorrectly implemented and fucks things up?
eli_sennesh: that won’t happen, because the way we produce FAI will involve an algorithm that looks at human brains and reverse-engineers their values, which then get implemented.
theOtherDave: just because the target specification is being produced by an algorithm doesn’t mean its results won’t fuck things up
e_s: yes it does, because the algorithm is a formally-specified epistemic procedure, which means its results are right by definition.
tOD: wtf?
So perhaps the problem is that I simply don’t understand why it is that a formally-specified epistemic procedure running on my brain to extract the target specification for a powerful optimization process should be guaranteed not to fuck things up.
Ah, ok. I’m going to have to double-reply here, and my answer should be taken as a personal perspective. This is actually an issue I’ve been thinking about and conversing over with an FHI guy, I’d like to hear any thoughts someone might have.
Basically, we want to extract a coherent set of terminal goals from human beings. So far, the approach to this problem is from two angles:
1) Neuroscience/neuroethics/neuroeconomics: look at how the human brain actually makes choices, and attempt to describe where and how in the brain terminal values are rooted. See: Paul Christiano’s “indirect normativity” write-up.
2) Pure ethics: there are lots of impulses in the brain that feed into choice, so instead of just picking one of those, let’s sit down and do the moral philosophy on how to “think out” our terminal values. See: CEV, “reflective equilibrium”, “what we want to want”, concepts like that.
My personal opinion is that we also need to add:
3) Population ethics: given the ability to extract values from one human, we now need to sample lots of humans and come up with an ethically sound way of combining the resulting goal functions (“where our wishes cohere rather than interfere”, blah blah blah) to make an optimization metric that works for everyone, even if it’s not quite maximally perfect for every single individual (that is, Shlomo might prefer everyone be Jewish, Abed might prefer everyone be Muslim, John likes being secular just fine, the combined and extrapolated goal function doesn’t perform mandatory religious conversions on anyone).
Now! Here’s where we get to the part where we avoid fucking things up! At least in my opinion, and as a proposal I’ve put forth myself, if we really have an accurate model of human morality, then we should be able to implement the value-extraction process on some experimental subjects, predictively generate a course of action through our model behind closed doors, run an experiment on serious moral decision-making, and then find afterwards that (without having seen the generated proposals before) our subjects’ real decisions either matched the predicted ones, or our subjects endorse the predicted ones.
That is, ideally, we should be able to test our notion of how to epistemically describe morality before we ever make that epistemic procedure or its outputs the goal metric for a Really Powerful Optimization Process. Short of things like bugs in the code or cosmic rays, we would thus (assuming we have time to carry out all the research before $YOUR_GEOPOLITICAL_ENEMY unleashes a paper-clipper For the Evulz) have a good idea what’s going to happen before we take a serious risk.
So, if I’ve understood your proposal, we could summarize it as:
Step 1: we run the value-extractor (seed AI, whatever) on group G and get V.
Step 2: we run a simulation of using V as the target for our optimizer.
Step 3: we show the detailed log of that simulation to G, and/or we ask G various questions about their preferences and see whether their answers match the simulation.
Step 4: based on the results of step 3, we decide whether to actually run our optimizer on V.
Have I basically understood you?
If so, I have two points, one simple and boring, one more complicated and interesting.
The simple one is that this process depends critically on our simulation mechanism being reliable. If there’s a design flaw in the simulator such that the simulation is wonderful but the actual results of running our optimizer is awful, the result of this process is that we endorse a wonderful world and create a completely different awful world and say “oops.”
So I still don’t see how this avoids the possibility of unfortunate implications. More generally, I don’t think anything we can do will avoid that possibility. We simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher.
The more interesting one… well, let’s assume that we do steps 1-3.
Step 4 is where I get lost. I’ve been stuck on this point for years.
I see step 4 going like this:
Some members of G (G1) say “Hey, awesome, sign me up!”
Other members of G (G2) say “I guess? I mean, I kind of thought there would be more $currently_held_sacred_value, but if your computer says this is what I actually want, well, who am I to argue with a computer?”
G3 says “You know, that’s not bad, but what would make it even better is if the bikeshed were painted yellow.”
G4 says “Wait, what? You’re telling me that my values, extrapolated and integrated with everyone else’s and implemented in the actual world, look like that?!? But… but… that’s awful! I mean, that world doesn’t have any $currently_held_sacred_value! No, I can’t accept that.”
G5 says “Yeah, whatever. When’s lunch?” …and so on.
Then we stare at all that and pull out our hair. Is that a successful test? Who knows? What were we expecting, anyway… that all of G would be in G1? Why would we expect that? Even if V is perfectly correct… why would we expect mere humans to reliably endorse it?
Similarly, if we ask G a bunch of questions to elicit their revealed preferences/decisions and compare those to the results of the simulation, I expect that we’ll find conflicting answers. Some things match up, others don’t, some things depend on who we ask or how we ask them or whether they’ve eaten lunch recently.
Actually, I think the real situation is even more extreme than that. This whole enterprise depends on the idea that we have actual values… the so-called “terminal” ones, which we mostly aren’t aware of right now, but are what we would want if we “learned together and grew together and yadda yadda”… which are more mutually reconcilable than the surface values that we claim to want or think we want (e.g., “everyone in the world embraces Jesus Christ in their hearts,” “everybody suffers as I’ve suffered,” “I rule the world!”).
If that’s true, it seems to me we should expect that the result of a superhuman intelligence optimizing the world for our terminal values and ignoring our surface values to seem alien and incomprehensible and probably kind of disgusting to the people we are right now.
And at that point we have to ask, what do we trust more? Our own brains, which say “BOO!”, or the value-extractor/optimizer/simulator we’ve built, which says “no, really, it turns out this is what you actually want; trust me.”?
If the answer to that is not immediately “we trust the software far more than our own fallible brains” we have clearly done something wrong.
But… in that case, why bother with the simulator at all? Just implement V, never mind what we think about it. Our thoughts are merely the reports of an obsolete platform; we have better tools now.
This is a special case of a more general principle: when we build tools that are more reliable than our own brains, and our brains disagree with the tools, we should ignore our own brains and obey the tools. Once a self-driving car is good enough, allowing human drivers to override it is at best unnecessary and at worst stupidly dangerous.
Similarly… this whole enterprise depends on building a machine that’s better at knowing what I really want than I am. Once we’ve built the machine, asking me what I want is at best unnecessary and at worst stupidly dangerous.
The idea is not to run a simulation of a tiny little universe, merely a simulation of a few people’s moral decision processes. Basically, run a program that prints out what our proposed FAI would have done given some situations, show that to our sample people, and check if they actually endorse the proposed course of action.
(There’s another related proposal for getting Friendly AI called value learning, which I’ve been scrawling notes on today. Basically, the idea is that the AI will keep a pool of possible utility functions (which are consistent, VNM-rational utility functions by construction), and we’ll use some evidence about humans to rate the probability that a given utility function is Friendly. Depending on the details of this whole process and the math actually working out, you would get a learning agent that steadily refines its utility function to be more and more one that humans can endorse.)
This is why I did actually say that population ethics is a wide-open problem in machine ethics. Meaning, yes, the population has broken into political factions. Humans have a noted tendency to do that.
Now, the whole point of Coherent Extrapolated Volition on a population-ethics level was to employ a fairly simple population-ethical heuristic: “where our wishes cohere rather than interfere”. Which, it seems to me, means: if people’s wishes run against each-other, do nothing at all, do something only if there exists unanimous/near-unanimous/supermajority agreement. It’s very democratic, in its way, but it will probably also end up implementing only the lowest common denominator. The result I expect to see from a naive all-humanity CEV with that population-ethic is something along the lines of, “People’s health is vastly improved, mortality becomes optional, food ripens more easily and is tastier, and everyone gets a house. You humans couldn’t actually agree on much more.”
Which is all pretty well and good, but it’s not much more than we could have gotten without going to the trouble of building a Friendly Artificial Intelligence!
Population ethics will also be rather much of a problem, I would venture to guess, because our tiny little ape-brains don’t have any hardware or instincts for large-scale population ethics. Our default moral machinery, when we are born, is configured to be loyal to our tribe, to help our friends and family, and to kill the shit out of the other tribe.
It will take some solid new intellectual developments in ethics to actually come up with some real way of dealing with the problem you’ve stated. I mean, sure, you could just go to the other extreme and say an FAI should shoot for population-ethical Pareto optimality, but that requires defining population-ethical Pareto optimality. For instance, is a person with many friends and loved-ones worth more to your notion of Pareto-optimality because insufficiently catering to their wishes hurts their friends and loved-ones, versus a hermit who has no close associates to be hurt by his Pareto-optimal misery?
What I can say, at the very least, is that prospective AI designers and an FAI itself should listen hard to groups G4 and G2 and figure out, ideally, how to actually fix the goal function so that G1 they shift over to one of the groups actively assenting. We should assume endorsement of FAI is like consent to sex: active enthusiasm should be the minimum standard of endorsement, with quiet acquiescence a strong sign of “NO! STOP!”.
Reality is what it is. We certainly can’t build a real FAI by slicing a goat’s neck within the bounds of a pentagram at 1am on the Temple Mount in Jerusalem chanting ancient Canaanite prayers that translate as “Come forth and optimize for our values!”. A prospective FAI-maker would have to actually know and understand moral psychology and neuro-ethics on a scientific level, which would give them a damn good idea of what sort of preference function they’re extracting.
(By the way, the whole point of reflective equilibrium is to wash out really stupid ideas like “everyone suffers as I’ve suffered”, which has never actually done anything for anyone.)
Supposedly, yeah. I would certainly hope that a real FAI would understand about how people prefer gradual transitions and not completely overthrow everything all at once to any degree greater than strictly necessary.
Well, why are you proposing to trust the program’s output? Check with the program’s creators, check the program’s construction, and check it’s input. It’s not an eldritch tome of lore, it’s a series of scientific procedures. The whole great, awesome thing about science is that results are checkable, in principle, by anyone with the resources to do so.
Run the extraction process over again! Do repeated experiments, possibly even on separate samples, yield similar answers? If so, well, it may well be that the procedure is trustworthy, in that it does what its creators have specified that it does.
In which case, check that the specification does definitely correspond, in some way, to “extract what we really want from what we supposedly want.”
In conclusion, SCIENCE!
So, suppose we do this, and we conclude that our FAI is in fact capable of reliably proposing courses of actions that, in general terms, people endorse.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree? Because back at the beginning of this conversation, it sounded like you were claiming you had in mind a process that was guaranteed not to fuck up, which is what I was skeptical about.
Well, I certainly expect that to work better than not using evidence. Beyond that, I’m really not sure what to say about it. Here again… suppose this procedure works wonderfully, and as a consequence of climbing that hill we end up with a consistent set of VNM-rational utility functions that humans reliably endorse when they read about them.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree?
Now you might reply “Well it’s the best we can do!” and I might agree. As I said earlier, we simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher. But let’s not pretend there’s no chance of failure.
I’m not sure I would describe those subgroups as political factions, necessarily… they’re just people expressing opinions at this stage. But sure, I could imagine analogous political factions.
Well, now, this is a different issue. I actually agree with you here, but I was assuming for the sake of argument that the CEV paradigm actually works, and gets a real, worthwhile converged result from G. That is, I’m assuming for the same of comity that G actually would, if they were “more the people they wished to be” and so on and so forth in all the poetic language of the CEV paper, agree on V, and that our value-extractor somehow figures that out because it’s really well-designed.
My point was that it doesn’t follow from that that G as they actually are will agree on V.
Sure, I agree—both that that’s the point of RE, and that ESAIS is a really stupid (though popular) idea.
But reflective equilibrium is a method with an endpoint we approach asymptotically. The degree of reflective equilibrium humans can reliably achieve after being put in a quiet, air-conditioned room for twenty minutes, fed nutritious food and played soothing music for that time, and then asked questions is less than that which we can achieve after ten years or two hundred years.
At some point, we have to define a target of how much reflective equilibrium we expect from our input, and from our evaluators. The further we shift our target away from where we are right now, the more really stupid ideas we will wash out, and the less likely we are to endorse the result. The further we shift it towards where we are, the more stupid ideas we keep, and the more likely we are to endorse the result.
I feel like we’re just talking past each other at this point, actually. I’m not talking about how quickly the FAI optimizes the world, I’m talking about whether we are likely to endorse the result of extracting our actual values.
(sigh) Yeah, OK. Tapping out now.
OOOOOOOOOOOOOOOOOOOOH. Ah. Ok. That is actually an issue, yes! Sorry I didn’t get what you meant before!
My answer is: that is an open problem, in the sense that we kind of need to know much more about neuro-ethics to answer it. It’s certainly easy to imagine scenarios in which, for instance, the FAI proposes to make all humans total moral exemplars, and as a result all the real humans who secretly like being sinful, even if they don’t endorse it, reject the deal entirely.
Yes, we have several different motivational systems, and the field of machine ethics tends to brush this under the rug by referring to everything as “human values” simply because the machine-ethics folks tend to contrast humans with paper-clippers to make a point about why machine-ethics experts are necessary.
This kind of thing is an example of the consideration that needs to be done to get somewhere. You are correct in saying that if FAI designers want their proposals to be accepted by the public (or even the general body of the educated elite) they need to cater not only to meta-level moral wishes but to actual desires and affections real people feel today. I would certainly argue this is an important component of Friendliness design.
This assumes that people are unlikely to endorse smart ideas. I personally disagree: many ideas are difficult to locate in idea-space, but easy to evaluate. Life extension, for example, or marriage for romance.
No, I have not solved AI Friendliness all on my lonesome. That would be a ridiculous claim, a crackpot sort of claim. I just have a bunch of research notes that, even with their best possible outcome, leave lots of open questions and remaining issues.
Certainly there’s a chance of failure. I just think there’s a lot we can and should do to reduce that chance. The potential rewards are simply too great not to.
For scenario 1, it would almost certainly require less free energy just to get the information directly from the brain without ever bringing the person to consciousness.
For scenario 2, you would seriously consider suicide if you fear that a failed friendly AI might soon be developed. Indeed, since there is a chance you will become incapacitated (say by falling into a coma) you might want to destroy your brain long before such an AI could arise.