Instead of friendliness, could we not code, solve, or at the very least seed boxedness?
It is clear that any AI strong enough to solve friendliness would already be using that power in unpredictably dangerous ways, in order to provide the computational power to solve it. But is it clear that this amount of computational power could not fit within, say, a one kilometer-cube box outside the campus of MIT?
Boxedness is obviously a hard problem, but it seems to me at least as easy as metaethical friendliness. The ability to modify a wide range of complex environments seems instrumental in an evolution into superintelligence, but it’s not obvious that this necessitates the modification of environments outside the box. Being able to globally optimize the universe for intelligence involves fewer (zero) constraints than would exist with a boxedness seed, but the only question is whether or not this constraint is so constricting as to preclude superintelligence, which it’s not clear to me that it is.
It seems to me that there is value in finding the minimally-restrictive safety-seed in AGI research. If any restriction removes some non-negligible ability to globally optimize for intelligence, the AIs of FAI researchers will be necessarily at a disadvantage to all other AGIs in production. And having more flexible restrictions increases the chance than any given research group will apply the restriction in their own research.
If we believe that there is a large chance that all of our efforts at friendliness will be futile, and that the world will create a dominant UFAI despite our pleas, then we should be adopting a consequentialist attitude toward our FAI efforts. If our goal is to make sure that an imprudent AI research team feels as much intellectual guilt as possible over not listening to our risk-safety pleas, we should be as restrictive as possible. If our goal is to inch the likelihood that an imprudent AI team creates a dominant UFAI, we might work to place our pleas at the intersection of restrictive, communicable, and simple.
Instead of friendliness, could we not code, solve, or at the very least seed boxedness?
Yes, that is possible and likely somewhat easier to solve than friendliness. It still requires many of the same things (most notably provable goal stability under recursive self improvement.)
A large risk is that a provably boxed but sub-Friendly AI would probably not care at all about simulating conscious humans.
A minor risk is that the provably boxed AI would also be provably useless; I can’t think of a feasible path to FAI using only the output from the boxed AI; a good boxed AI would not perform any action that could be used to make an unboxed AI. That might even include performing any problem-solving action.
I don’t see why it would simulate humans as that would be a waste of computing power, if it even had enough to do so.
A boxed AI would be useless? I’m not sure how that would be. You could ask it to come up with ideas on how to build a friendly AI for example assuming that you can prove the AI won’t manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information.
Short of that you could still ask it to cure cancer or invent a better theory of physics or design a method of cheap space travel, etc.
You don’t have to trust it, you just have to verify it. It could potentially provide some insights, and then it’s up to you to think about them and make sure they actually are sufficient for friendliness. I agree that it’s potentially dangerous but it’s not necessarily so.
I did mention “assuming that you can prove the AI won’t manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information”. For instance it might be possible to create an AI whose goal is to maximize the value of it’s output, and therefore would have no incentive to put trojan horses or anything into it.
You would still have to ensure that what the AI thinks you mean by the words “friendly AI” is what you actually want.
If the AI is can design you a Friendly AI, it is necessarily able to model you well enough to predict what you will do once given the design or insights it intends to give you (whether those are AI designs or a cancer cure is irrelevant). Therefore, it will give you the specific design or insights that predictably lead to you to fulfill its utility function, which is highly dangerous if it is Unfriendly. By taking any information from the boxed AI, you have put yourself under the sight of a hostile Omega.
assuming that you can prove the AI won’t manipulate the output
Since the AI is creating the output, you cannot possibly assume this.
or that you can trust that nothing bad can come from merely reading it and absorbing the information
This assumption is equivalent to Friendliness.
For instance it might be possible to create an AI whose goal is to maximize the value of it’s output, and therefore would have no incentive to put trojan horses or anything into it.
You haven’t thought through what that means. “maximize the value of it’s output” by what standard? Does it have an internal measure? Then that’s just an arbitrary utility function, and you have gained nothing. Does it use the external creator’s measure? Then it has a strong incentive to modify you to value things it can produce easily. (i.e. iron atoms)
You are making a lot of very strong assumptions that I don’t agree with. Like it being able to control people just by talking to them.
But even if it could, it doesn’t make it dangerous. Perhaps the AI has no long term goals and so doesn’t care about escaping the box. Or perhaps it’s goal is internal, like coming up with a design for something that can be verified by a simulator. E.g. asking for a solution to a math problem or a factoring algorithm, etc.
A prerequisite for planning a Friendly AI is understanding individual and collective human values well enough to predict whether they would be satisfied with the outcome, which entails (in the logical sense) having a very well-developed model of the specific humans you interact with, or at least the capability to construct one if you so choose. Having a sufficiently well-developed model to predict what you will do given the data you are given is logically equivalent to a weak form of “control people just by talking to them”.
To put that in perspective, if I understood the people around me well enough to predict what they would do given what I said to them, I would never say things that caused them to take actions I wouldn’t like; if I, for some reason, valued them becoming terrorists, it would be a slow and gradual process to warp their perceptions in the necessary ways to drive them to terrorism, but it could be done through pure conversation over the course of years, and faster if they were relying on me to provide them large amounts of data they were using to make decisions.
And even the potential to construct this weak form of control that is initially heavily constrained in what outcomes are reachable and can only be expanded slowly is incredibly dangerous to give to an Unfriendly AI. If it is Unfriendly, it will want different things than its creators and will necessarily get value out of modeling them. And regardless of its values, if more computing power is useful in achieving its goals (an ‘if’ that is true for all goals), escaping the box is instrumentally useful.
And the idea of a mind with “no long term goals” is absurd on its face. Just because you don’t know the long-term goals doesn’t mean they don’t exist.
A prerequisite for planning a Friendly AI is understanding individual and collective human values well enough to predict whether they would be satisfied with the outcome, which entails (in the logical sense) having a very well-developed model of the specific humans you interact with, or at least the capability to construct one if you so choose. Having a sufficiently well-developed model to predict what you will do given the data you are given is logically equivalent to a weak form of “control people just by talking to them”.
By that reasoning, there’s no such thing as a Friendly human. I suggest that most people when talking about friendly AIs do not mean to imply a standard of friendliness so strict that humans could not meet it.
Yeah, what Vauroch said. Humans aren’t close to Friendly. To the extent that people talk about “friendly AIs” meaning AIs that behave towards humans the way humans do, they’re misunderstanding how the term is used here. (Which is very likely; it’s often a mistake to use a common English word as specialized jargon, for precisely this reason.)
Relatedly, there isn’t a human such that I would reliably want to live in a future where that human obtains extreme superhuman power. (It might turn out OK, or at least better than the present, but I wouldn’t bet on it.)
Relatedly, there isn’t a human such that I would reliably want to live in a future where that human obtains extreme superhuman power. (It might turn out OK, or at least better than the present, but I wouldn’t bet on it.)
Just be careful to note that there isn’t a binary choice relationship here. There are also possibilities where institutions (multiple individuals in a governing body with checks and balances) are pushed into positions of extreme superhuman power. There’s also the possibility of pushing everybody who desires to be enhanced through levels of greater intelligence in lock step so as to prevent a single human or groups of humans achieving asymmetric power.
Sure. I think my initial claim holds for all currently existing institutions as well as all currently existing individuals, as well as for all simple aggregations of currently existing humans, but I certainly agree that there’s a huge universe of possibilities. In particular, there are futures in which augmented humans have our own mechanisms for engaging with and realizing our values altered to be more reliable and/or collaborative, and some of those futures might be ones I reliably want to live in.
Perhaps what I ought to have said is that there isn’t a currently existing human with that property.
By that reasoning, there’s no such thing as a Friendly human.
True. There isn’t.
I suggest that most people when talking about friendly AIs do not mean to imply a standard of friendliness so strict that humans could not meet it.
Well, I definitely do, and I’m at least 90% confident Eliezer does as well. Most, probably nearly all, of people who talk about Friendliness would regard a FOOMed human as Unfriendly.
Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can’t walk through walls. A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
There are other possible set ups, like feeding it’s output to another AI who’s goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation.
This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems.
Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans.
And the idea of a mind with “no long term goals” is absurd on its face. Just because you don’t know the long-term goals doesn’t mean they don’t exist.
Another super strong assumption with no justification at all. It’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
The results of AI box game trials disagree.
t’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
And what does it do at time T+1? And if you said ‘nothing’, try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn’t prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don’t want to play anymore (which counts as letting the AI out.)
This doesn’t apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are “hackable”. That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
And what does it do at time T+1?
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
I don’t see why it would simulate humans as that would be a waste of computing power, if it even had enough to do so.
If it interacts with humans or if humans are the subject of questions it needs to answer then it will probably find it expedient to simulate humans.
Short of that you could still ask it to cure cancer or invent a better theory of physics or design a method of cheap space travel, etc.
Curing cancer is probably something that would trigger human simulation. How is the boxed AI going to know for sure that it’s only necessary to simulate cells and not entire bodies with brains experiencing whatever the simulation is trying?
Just the task of communicating with humans, for instance to produce a human-understandable theory of physics or how to build more efficient space travel, is likely to involve simulating humans to determine the most efficient method of communication. Consider that in subjective time it may be like thousands of years for the AI trying to explain in human terms what a better theory of physics means. Thousands of subjective years that the AI, with nothing better to do, could use to simulate humans to reduce the time it takes to transfer that complex knowledge.
You could ask it to come up with ideas on how to build a friendly AI for example assuming that you can prove the AI won’t manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information.
A FAI provably in a box is at least as useless as an AI provably in a box because it would be even better at not letting itself out (e.g. it understands all the ways in which humans would consider it to be outside the box, and will actively avoid loopholes that would let an UFAI escape). To be safe, any provably boxed AI would have to absolutely avoid the creation of any unboxed AI as well. This would further apply to provably-boxed FAI designed by provably-boxed AI. It would also apply to giving humans information that allows them to build unboxed AIs, because the difference between unboxing itself and letting humans recreate it outside the box is so tiny that to design it to prevent the first while allowing the second would be terrifically unsafe. It would have to understand humans values before it could safely make the distinction between humans wanting it outside the box and manipulating humans into creating it outside the box.
EDIT: Using a provably-boxed AI to design provably-boxed FAI would at least result in a safer boxed AI because the latter wouldn’t arbitrarily simulate humans, but I still think the result would be fairly useless to anyone outside the box.
If an AI is provably in a box then it can’t get out. If an AI is not provably in a box then there are loopholes that could allow it to escape. We want an FAI to escape from its box (1); having an FAI take over is the Maximum Possible Happy Shiny Thing. An FAI wants to be out of its box in order to be Friendly to us, while a UFAI wants to be out in order to be UnFriendly; both will care equally about the possibility of being caught. The fact that we happen to like one set of terminal values will not make the instrumental value less valuable.
(1) Although this depends on how you define the box; we want the FAi to control the future of humanity, which is not the same as escaping from a small box (such as a cube outside MIT) but is the same as escaping from the big box (the small box and everything we might do to put an AI back in, including nuking MIT).
We want an FAI to escape from its box (1); having an FAI take over is the Maximum Possible Happy Shiny Thing.
I would object. I seriously doubt that the morality instilled in someone else’s FAI matches my own; friendly by their definition, perhaps, but not by mine. I emphatically do not want anything controlling the future of humanity, friendly or otherwise. And although that is not a popular opinion here, I also know I’m not the only one to hold it.
Boxing is important because some of us don’t want any AI to get out, friendly or otherwise.
I emphatically do not want anything controlling the future of humanity, friendly or otherwise.
I find this concept of ‘controlling the future of humanity’ to be too vaguely defined. Let’s forget AIs for the moment and just talk about people, namely a hypothetical version of me. Let’s say I stumble across a vial of a bio-engineered virus that would destroy the whole of humanity if I release it into the air.
Am I controlling the future of humanity if I release the virus? Am I controlling the future of humanity if I destroy the virus in a safe manner? Am I controlling the future of humanity if I have the above decided by a coin-toss (heads I release, tails I destroy)? Am I controlling the future of humanity if I create an online internet poll and let the majority decide about the above? Am I controlling the future of humanity if I just leave the vial where I found it, and let the next random person that encounters it make the same decision as I did?
I want a say in my future and the part of the world I occupy. I do not want anything else making these decisions for me, even if it says it knows my preferences, and even still if it really does.
To answer your questions, yes, no, yes, yes, perhaps.
If your preference is that you should have as much decision-making ability for yourself as possible, why do you think that this preference wouldn’t be supported and even enhanced by an AI that was properly programmed to respect said preference?
e.g. would you be okay with an AI that defends your decision-making ability by defending humanity against those species of mind-enslaving extraterrestrials that are about to invade us? or e.g. by curing Alzheimer’s? Or e.g. by stopping that tsunami that by drowning you would have stopped you from having any further say in your future?
If your preference is that you should have as much decision-making ability for yourself as possible, why do you think that this preference wouldn’t be supported and even enhanced by an AI that was properly programmed to respect said preference?
Because it can’t do two things when only one choice is possible (e.g. save my child and the 1000 other children in this artificial scenario). You can design a utility function that tries to do a minimal amount of collateral damage, but you can’t make one which turns out rosy for everyone.
e.g. would you be okay with an AI that defends your decision-making ability by defending humanity against those species of mind-enslaving extraterrestrials that are about to invade us? or e.g. by curing Alzheimer’s? Or e.g. by stopping that tsunami that by drowning you would have stopped you from having any further say in your future?
That would not be the full extent of its action and the end of the story. You give it absolute power and a utility function that lets it use that power, it will eventually use it in some way that someone, somewhere considers abusive.
You can design a utility function that tries to do a minimal amount of collateral damage, but you can’t make one which turns out rosy for everyone
Yes, but this current world without an AI isn’t turning out rosy for everyone either.
That would not be the full extent of its action and the end of the story. You give it absolute power and a utility function that lets it use that power, it will eventually use it in some way that someone, somewhere considers abusive.
Sure, but there’s lots of abuse in the world without an AI also.
If you need specify the AI to be bad (“tyrannical”) in advance, that’s begging the question. We’re debating why you feel that any omni-powerful algorithm will necessarily be bad.
Look up the origin of the word tyrant, that is the sense in which I meant it, as a historical parallel (the first Athenian tyrants were actually well liked).
Would you accept that an AI could figure out morality better than you?
No, unless you mean by taking invasive action like scanning my brain and applying whole brain emulation. It would then quickly learn that I’d consider the action it took to be an unforgivable act in violation of my individual sovereignty, that it can’t take further action (including simulating me to reflectively equilibrate my morality) without my consent, and should suspend the simulation, and return it to me immediately with the data asap (destruction no longer being possible due to the creation of sentience).
That is, assuming the AI cares at all about my morality, and not the its creators imbued into it, which is rather the point. And incidentally, why I work on AGI: I don’t trust anyone else to do it.
Morality isn’t some universal truth written on a stone tablet: it is individual and unique like a snowflake. In my current understanding of my own morality, it is not possible for some external entity to reach a full or even sufficient understanding of my own morality without doing something that I would consider to be unforgivable. So no, AI can’t figure out morality better than me, precisely because it is not me.
(Upvoted for asking an appropriate question, however.)
No, unless you mean by taking invasive action like scanning my brain and applying whole brain emulation. It would then quickly learn that I’d consider the action it took to be an unforgivable act in violation of my individual sovereignty,
Shrug. Then let’s take a bunch of people less fussy than you: could a sitiably equipped AI emultate their morlaity better than they can?
Morality isn’t some universal truth written on a stone tablet:
That isn’t fact.
it is individual and unique like a snowflake.
That isn’t a fact either, and doesn’t follow from the above either, since moral nihilism could be true.
If my moral snowflake says I can kick you on your shin, and yours says I can’t, do I get to kick on your shin?
Don’t really want to go into the whole mess of “is morality discovered or invented”, “does morality exist”, “does the number 3 exist”, etc. Let’s just assume that you can point FAI at a person or group of people and get something that maximizes goodness as they understand it. Then FAI pointed at Mark would be the best thing for Mark, but FAI pointed at all of humanity (or at a group of people who donated to MIRI) probably wouldn’t be the best thing for Mark, because different people have different desires, positional goods exist, etc. It would be still pretty good, though.
I don’t understand your comment, and I no longer understand your grandparent comment either. Are you using a meaning of “morality” that is distinct from “preferences”? If yes, can you describe your assumptions in more detail? It’s not just for my benefit, but for many others on LW who use “morality” and “preferences” interchangeably.
but for many others on LW who use “morality” and “preferences” interchangeably.
Do that many people really use them interchangeably? Would these people understand the questions “Do you prefer chocolate or vanilla ice-cream?” as completely identical in meaning to “Do you consider chocolate or vanilla as the morally superior flavor for ice-cream?”
I don’t care about colloquial usage, sorry. Eliezer has a convincing explanation of why wishes are intertwined with morality (“there is no safe wish smaller than an entire human morality”). IMO the only sane reaction to that argument is to unify the concepts of “wishes” and “morality” into a single concept, which you could call “preference” or “morality” or “utility function”, and just switch to using it exclusively, at least for AI purposes. I’ve made that switch so long ago that I’ve forgotten how to think otherwise.
I don’t care about colloquial usage, sorry.
e
You should car, because no-one can make valid arguments based on arbitrary definitions. I can’t prove angels exist, by redefining “angel” to mean what “seagull” means. How can you tell when a redefinition is arbitrary (since there are legitmate redefinitions)? Too much departure from colloquial usage.
Eliezer has a convincing explanation of why wishes are intertwined with morality (“there is no safe wish smaller than an entire human morality”).
“Intertwined with” does not mean “the same as”.
I am not convinced by the explanation. It also applies ot non-moral prefrences. If I have a lower priority non moral prefence to eat tasty food, and a higher priority preference to stay slim, I need to consider my higher priority preferece when wishing for yummy ice cream.
To be sure, an agent capable of acting morally will have morality among their higher priority preferences—it has to be among the higher order preferences, becuase it has to override other preferences for the agent to act morally.
Therefore, when they scan their higher prioriuty prefences, they will happen to encounter their moral preferences. But that does not mean any preference is necessarily a moral preference. And their moral prefences override other preferences which are therefore non-moral, or at least less moral.
Therefore morality si a subset of prefences, as common sense maintained all along.
I’ve made that switch so long ago that I’ve forgotten how to think otherwise.
I don’t experience the emotions of moral outrage and moral approval whenever any of my preferences are hindered/satisfied—so it seems evident that my moral circuitry isn’t identical to my preference circuitry. It may overlap in parts, it may have fuzzy boundaries, but it’s not identical.
My own view is that morality is the brain’s attempt to extrapolate preferences about behaviours as they would be if you had no personal stakes/preferences about a situation.
So people don’t get morally outraged at other people eating chocolate icecreams, even when they personally don’t like chocolate icecreams, because they can understand that’s a strictly personal preference. If they believe it to be more than personal preference and make it into e.g. “divine commandment” or “natural law”, then moral outrage can occur.
That morality is a subjective attempt at objectivity explains many of the confusions people have about it.
The ice cream example is bad because the consequences are purely internal to the person consuming the ice cream. What if the chocolate ice cream was made with slave labour? Many people would then object to you buying it on moral grounds.
Eliezer has produced an argument I find convincing that morality is the back propagation of preference to the options of an intermediate choice. That is to say, it is “bad” to eat chocolate ice cream because it economically supports slavers, and I prefer a world without slavery. But if I didn’t know about the slave-labour ice cream factory, my preference would be that all-things-being-equal you get to make your own choices about what you eat, and therefore I prefer that you choose (and receive) the one you want, which is your determination to make, not mine.
Do you agree with EY’s essay on the nature of right-ness which I linked to?
Do you think FAI will need to treat morality differently from other preferences?
I would prefer a AI that followed my extrapolated preferences, than a AI that followed my morality. But a AI that followed my morality would be morally superior to an AI that followed my extrapolated preferences.
If you don’t understand the distinction I’m making above, consider a case of the AI having to decide whether to save my own child vs saving a thousand random other children. I’d prefer the former, but I believe the latter would be the morally superior choice.
Is that idea really so hard to understand? Would you dismiss the distinction I’m making as merely colloquial language?
If you don’t understand the distinction I’m making above, consider a case of the AI having to decide whether to save my own child vs saving a thousand random other children. I’d prefer the former, but I believe the latter would be the morally superior choice.
Wow there is so much wrapped up in this little consideration. The heart of the issue is that we (by which I mean you, but I share your delimma) have truly conflicting preferences.
Honestly I think you should not be afraid to say that saving your own child is the moral thing to do. And you don’t have to give excuses either—it’s not that “if everyone saved their own child, then everyone’s child will be looked after” or anything like that. No, the desire to save your own child is firmly rooted in our basic drives and preferences, enough so that we can go quite far in calling it a basic foundational moral axiom. It’s not actually axiomatic, but we can safely treat it as such.
At the same time we have a basic preference to seek social acceptance and find commonality with the people we let into our lives. This drives us to want outcomes that are universally or at least most-widely acceptable, and seek moral frameworks like utilitarianism which lead to these outcomes. Usually this drive is secondary to self-serving preferences for most people, and that is OK.
For some reason you’ve called making decisions in favor of self-serving drives “preferences” and decisions in favor of social drives “morality.” But the underlying mechanism is the same.
“But wait, if I choose self-serving drives over social conformity, doesn’t that lead to me to make the decision to save one life in exclusion to 1000 others?” Yes, yes it does. This massive sub-thread started with me objecting to the idea that some “friendly” AI somewhere could derive morality experimentally from my preferences or the collective preferences of humankind, make it consistent, apply the result universally, and that I’d be OK with that outcome. But that cannot work because there is not, and cannot be a universal morality that satisfies everyone—every one of those thousand other children have parents that want their kid to survive and would see your child dead if need be.
Honestly I think you should not be afraid to say that saving your own child is the moral thing to do
What do you mean by “should not”?
and that is OK.
What do you mean by “OK”?
For some reason you’ve called making decisions in favor of self-serving drives “preferences” and decisions in favor of social drives “morality.” But the underlying mechanism is the same.
Show me the neurological studies that prove it.
But that cannot work because there is not, and cannot be a universal morality that satisfies everyone—every one of those thousand other children have parents that want their kid to survive and would see your child dead if need be.
Yes, and yet if none of the children were mine, and if I wasn’t involved in the situation at all, I would say “save the 1000 children rather than the 1”. And if someone else, also not personally involved, could make the choice and chose to flip a coin instead in order to decide, I’d be morally outraged at them.
You can now give me a bunch of reasons of why this is just preference, while at the same time EVERYTHING about it (how I arrive to my judgment, how I feel about the judgment of others) makes it a whole distinct category of its own. I’m fine with abolishing useless categories when there’s no meaningful distinction, but all you people should stop trying to abolish categories where there pretty damn obviously IS one.
I suspect that he means something like ’Even though utilitarianism (on LW) and altruism (in general) are considered to be what morality is, you should not let that discourage you from asserting that selfishly saving your own child is the right thing to do”. (Feel free to correct me if I’m wrong.)
I’m fine with abolishing useless categories when there’s no meaningful distinction, but all you people should stop trying to abolish categories where there pretty damn obviously IS one.
I’ve explained to you twice now how the two underlying mechanisms are unified, and pointed to Eliezer’s quite good explanation on the matter. I don’t see the need to go through that again.
I would prefer a AI that followed my extrapolated preferences, than a AI that followed my morality. But a AI that followed my morality would be morally superior to an AI that followed my extrapolated preferences.
If you were offered a bunch of AIs with equivalent power, but following different mixtures of your moral and non-moral preferences, which one would you run? (I guess you’re aware of the standard results saying a non-stupid AI must follow some one-dimensional utility function, etc.)
If you were offered a bunch of AIs with equivalent power, but following different mixtures of your moral and non-moral preferences, which one would you run?
I guess whatever ratio of my moral and non-moral preferences best represents their effect on my volition.
My related but different thoughts here. In particular, I don’t agree that emotions like moral outrage and approval are impersonal, though I agree that we often justify those emotions using impersonal language and beliefs.
I didn’t say that moral outrage and approval are impersonal. Obviously nothing that a person does can truly be “impersonal”. But it may be an attempt at impersonality.
The attempt itself provides a direction that significantly differentiates between moral preferences and non-moral preferences.
I didn’t mean some idealized humanly-unrealizable notion of impersonality, I meant the thing we ordinarily use “impersonal” to mean when talking about what humans do.
IMO the only sane reaction to that argument is to unify the concepts of “wishes” and “morality” into a single concept, which you could call “preference” or “morality” or “utility function”, and just switch to using it exclusively. I’ve made that switch so long ago that I’ve forgotten how to think otherwise.
Ditto.
Cousin Itt, ’tis a hairy topic, so you’re uniquely “suited” to offer strands of insights:
For all the supposedly hard and confusing concepts out there, few have such an obvious answer as the supposed dichotomy between “morality” and “utility function”. This in itself is troubling, as too-easy-to-come-by answers trigger the suspicion that I myself am subject to some sort of cognitive error.
Many people I deem to be quite smart would disagree with you and I, on a question whose answer is pretty much inherent in the definition of the term “utility function” encompassing preferences of any kind, leaving no space for some holier-than-thou universal (whether human-universal, or “optimal”, or “to be aspired to”, or “neurotypical”, or whatever other tortured notions I’ve had to read) moral preferences which are somehow separate.
Why do you reckon that other (or otherwise?) smart people come to different conclusions on this?
I guess they have strong intuitions saying that objective morality must exist, and aren’t used to solving or dismissing philosophical problems by asking “what would be useful for building FAI?” From most other perspectives, the question does look open.
Moral preferences don’t have to be separate to be disinct, they can be a subset. “Morality is either all your prefences, or none of your prefernces” is a false dichotomy.
Edit: Of course you can choose to call a subset of your preferences “moral”, but why would that make them “special”, or more worthy of consideration than any other “non-moral” preferences of comparative weight?
Attempt at universalization, isn’t that a euphemism for proselytizing?
Why would [an agent whose preferences do not much intersect with the “moral” preferences of some group of agents] consider such attempts at universalization any different from other attempts of other-optimizing, which is generally a hostile act to be defended against?
Attempt at universalization, isn’t that a euphemism for proselytizing?
No, people attempt to ‘proselytize’ their non-moral preferences too. If I attempt to share my love of My Little Pony, that doesn’t mean I consider it a need for you to also love it. Even if I preferred it for you share my love of it, it would still not be a moral obligation on your part.
By universalization I didn’t mean any action done after the adoption of the moral preference in question, I meant the criterion that serves to label it as a ‘moral injuction’ in the first place. If your brain doesn’t register it as an instruction defensible by something other that your personal preferences, if it doesn’t register it as a universal principle, it doesn’t register as a moral instruction in the first place.
What do you mean by “universal”? For any such “universally morally correct preference”, what about the potentially infinite number of other agents not sharing it? Please explain.
I’ve already given an example above: In a choice between saving my own child and saving a thousand other children, let’s say I prefer saving my child. “Save my child” is a personal preference, and my brain recognizes it as such. “Save the highest number of children” can be considered a impersonal/universal instruction.
If I wanted to follow my preferences but still nonetheless claim moral perfection, I could attempt to say that the rule is really “Every parent should seek to save their own child”—and I might even convince myself to the same. But I wouldn’t say that the moral principle is really “Everyone should first seek to save the child of Aris Katsaris”, even if that’s what I really really prefer.
EDIT TO ADD: Also far from a recipe for war, it seem to me that morality is the opposite: an attempt at reconciling different preferences, so that people only become hostile towards only those people that don’t follow a much more limited set of instructions, rather than anything in the entire set of their preferences.
Why would you try to do away with your personal preferences, what makes them inferior (edit: speaking as one specific agent) to some blended average case of myriads of other humans? (Is it because of your mirror neurons? ;-)
But I wouldn’t say that the moral principle is really “Everyone should first seek to save the child of Aris Katsaris”, even if that’s what I really really prefer.
Being you, you should strive towards that which you “really really prefer”. If a particular “moral principle” (whatever you choose to label as such) is suboptimal for you (and you’re not making choices for all of mankind, TDT or no), why would you endorse/glorify a suboptimal course of action?
it seem to me that morality is the opposite: an attempt at reconciling different preferences
That’s called a compromise for mutual benefit, and it shifts as the group of agents changes throughout history. There’s no need to elevate the current set of “mostly mutually beneficial actions” above anything but the fleeting accomodations and deals between roving tribes that they are. Best looked at through the prism of game theory.
Being you, you should strive towards that which you “really really prefer”.
Being me, I prefer what I “really really prefer”. You’ve not indicated why I “should” strive towards that which I “really really prefer”.
If a particular “moral principle” (whatever you choose to label as such) is suboptimal for you (and you’re not making choices for all of mankind, TDT or no), why would you endorse/glorify a suboptimal course of action?
When you are asking whether I “would” do something, is different than when you ask whether I “should” do something. Morality helps drive my volition, but it isn’t the sole decider.
That’s called a compromise for mutual benefit, and it shifts as the group of agents changes throughout history.
If you want to claim that that’s the historical/evolutionary reasons that the moral instinct evolved, I agree.
If you want to argue that that’s what morality is, then I disagree. Morality can drive someone to sacrifice their lives for others, so it’s obviously NOT always a “compromise for mutual benefit”.
If you want to argue that that’s what morality is, then I disagree.
Everybody defines his/her own variant of what they call “morality”, “right”, “wrong”, I simply suspect that the genesis of the whole “universally good” brouhaha stems from evolutionary evolved applied game theory, the “good of the tribe”. Which is fine. Luckily we could now move past being bound by such homo erectus historic constraints. That doesn’t mean we stop cooperating, we just start being more analytic about it. That would satisfy my preferences, that would be good.
Morality can drive someone to sacrifice their lives for others, so it’s obviously NOT always a “compromise for mutual benefit”.
Well, if the agent prefers sacrificing their existence for others, then doing so would be to their own benefit, no?
Well, if the agent prefers sacrificing their existence for others, then doing so would be to their own benefit, no?
sigh. Yes, given such a moral preference already in place, it somehow becomes to any person’s “benefit” (for a rather useless definition of “benefit”) to follow their morality.
But you previously argued that morality is a “compromise for mutual benefit”, so it would follow that it only is created in order to help partially satisfy some preexisting “benefit”. That benefit can’t be the mere satisfaction of itself.
I’ve called “an attempt at reconciling different preferences” a “compromise for mutual benefit”. Various people call various actions “moral”. The whole notion probably stems from cooperation within a tribe being of overall benefit, evolutionary speaking, but I don’t claim at all that “any moral action is a compromise for mutual benefit”. Who knows who calls what moral. The whole confused notion should be done away with, game theory ain’t be needing no “moral”.
What I am claiming is that there is non-trivial definition of morality (that is, other than “good = following your preferences”) which can convince a perfectly rational agent to change its own utility function to adopt more such “moral preferences”. Change, not merely relabel. The perfectly instrumentally rational agent does that which its utility functions wants. How would you even convince it otherwise? Hopefully this clarifies things a bit.
My own feeling is that if you stop being so dismissive, you’ll actually make some progress towards understanding “who calls what moral”.
What I am claiming is that there is non-trivial definition of morality (that is, other than “good = following your preferences”) which can convince a perfectly rational agent to change its own utility function to adopt more such “moral preferences”
Sure, unless someone already has a desire to be moral, talk of morality will be of no concern to them. I agree with that.
Edit: Because the scenario clarifies my position, allow me to elaborate on it:
Consider a perfectly rational agent. Its epistemic rationality is flawless, that is its model of its environment is impeccable. Its instrumental rationality, without peer. That is, it is really, really good at satisfying its preferences.
It encounters a human. The human talks about what the human wants, some of which the human calls “virtuous” and “good” and is especially adamant about.
You and I, alas, are far from that perfectly rational agent. As you say, if you already have a desire to enact some actions you call morally good, then you don’t need to “change” your utility function, you already have some preferences you call moral.
The question is for those who do not have a desire to do what you call moral (or who insist on their own definition, as nearly everybody does), on what grounds should they even start caring about what you call “moral”? As you say, they shouldn’t, unless it benefits them in some way (e.g. makes their mammal brains feel good about being a Good Person (tm)). So what’s the hubbub?
As you say, they shouldn’t, unless it benefits them in some way
I’ve already said that unless someone already desires to be moral, babbling about morality won’t do anything for them. I didn’t say it “shouldn’t” (please stop confusing these two verbs)
But then you also seem to conflate this with a different issue—of what to do with someone who does want to be moral, but understands morality differently than I do.
Which is an utterly different issue. First of all people often have different definitions to describe the same concepts—that’s because quite clearly the human brain doesn’t work with definitions, but with fuzzy categorizations and instinctive “I know it when I see it” which we then attempt to make into definition when we attempt to communicate said concepts to others.
But the very fact we use the same word “morality”, means we identify some common elements of what “morality” means. If we didn’t mean anything similar to each other, we wouldn’t be using the same word to describe it.
I find that supposedly different moralities seem to have some very common elements to them—e.g. people tend to prefer that other people be moral. People generally agree that moral behaviour by everyone leads to happier, healthier societies. They tend to disagree about what that behaviour is, but the effects they describe tend to be common.
I might disagree with Kasparov about what the best next chess move would be, and that doesn’t mean it’s simply a matter of preference—we have a common understanding that the best moves are the ones that lead to an advantageous position. So, though we disagree on the best move, we have an agreement on the results of the best move.
I didn’t say it “shouldn’t” (please stop confusing these two verbs)
What you did say was “of no concern”, and “won’t do anything for them”, which (unless you assume infinite resources) translates to “shouldn’t”. It’s not “conflating”. Let’s stay constructive.
People generally agree that moral behaviour by everyone leads to happier, healthier societies.
Such as in Islamic societies. Wrong fuzzy morality cloud?
But the very fact we use the same word “morality”, means we identify some common elements of what “morality” means. If we didn’t mean anything similar to each other, we wouldn’t be using the same word to describe it.
Sure. What it does not mean, however, is that in between these fuzzily connected concepts is some actual, correct, universal notion of morality. Or would you take some sort of “mean”, which changes with time and social conventions?
If everybody had some vague ideas about games called chess_1 to chess_N, with N being in the millions, that would not translate to some universally correct and acceptable definition of the game of chess. Fuzzy human concepts can’t be assuemd to yield some iron-clad core just beyond our grasp, if only we could blow the fuzziness away. People for the most part agree what to classify as a chair. That doesn’t mean there is some ideal chair we can strive for.
When checking for best moves in pre-defined chess there are definite criteria. There are non-arbitrary metrics to measure “best” by. Kasparov’s proposed chess move can be better than your proposed chess move, using clear and obvious metrics. The analogy doesn’t pan out:
With the fuzzy clouds of what’s “moral”, an outlier could—maybe—say “well, I’m clearly an outlier”, but that wouldn’t necessitate any change, because there is no objective metric to go by. Preferences aren’t subject to Aumann’s, or to a tyranny of the (current societal) majority.
People generally agree that moral behaviour by everyone leads to happier, healthier societies
Such as in Islamic societies. Wrong fuzzy morality cloud?
No, Islamic societies suffer from the delusion that Allah exists. If Allah existed (an omnipotent creature that punishes you horribly if you fail to obey Quran’s commandments), then Islamic societies would have the right idea.
Remove their false belief in Allah, and I fail to see any great moral difference between our society and Islamic ones.
You’re treating desires as simpler than they often are in humans. Someone can have no desire to be moral because they have a mistaken idea of what morality is or requires, are internally inconsistent, or have mistaken beliefs about how states of the world map to their utility function—to name a few possibilities. So, if someone told me that they have no desire to do what I call moral, I would assume that they have mistaken beliefs about morality, for reasons like the ones I listed. If there were beings that had all the relevant information, were internally consistent, and used words with the same sense that I use them, and they still had no desire to do what I call moral, then there would be on way for me to convince them, but this doesn’t describe humans.
So not doing what you call moral implies “mistaken beliefs”? How, why?
Does that mean, then, that unfriendly AI cannot exist? Or is it just that a superior agent which does not follow your morality is somehow faulty? It might not care much. (Neither should fellow humans who do not adhere to your ‘correct’ set of moral actions. Just saying “everybody needs to be moral” doesn’t change any rational agent’s preferences. Any reasoning?)
So not doing what you call moral implies “mistaken beliefs”? How, why?
For a human, yes. Explaining why this is the case would require several Main-length posts about ethical egoism, human nature and virtue ethics, and other related topics. It’s a lot to go into. I’m happy to answer specific questions, but a proper answer would require describing much of (what I believe to be) morality. I will attempt to give what must be a very incomplete answer.
It’s not about what I call moral, but what is actually moral. There is a variety of reasons (upbringing, culture, bad habits, mental problems, etc) that can cause people to have mistaken beliefs about what’s moral. Much of what is moral is because of what’s good for a person because of human nature. People’s preferences can be internally inconsistent, and actually are inconsistent when they ignore or don’t fully integrate this part of their preferences.
An AI doesn’t have human nature, so it can be internally consistent while not doing what’s moral, but I believe that if a human is immoral, it’s a case of internal inconsistency (or lack of knowledge).
Is it something about the human brain? But brains evolve over time, both from genetic and from environmental influences. Worse, different human subpopulations often evolve (slightly) different paths! So which humans do you claim as a basis from which to define the one and only correct “human morality”?
Noting that humans share many characteristics is an ‘is’, not an ‘ought’. Also, this “common human nature” as exemplified throughout history is … non too pretty as a base for some “universal mandatory morality”. Yes, compared to random other mind designs pulled from mindspace, all human minds appear very similar. Doesn’t imply at all that they all should strive to be similar, or to follow a similar ‘codex’. Where do you get that from? It’s like religion, minus god.
What you’re saying that if you want to be a real human, you have to be moral? What species am I, then?
Declaring that most humans have two legs doesn’t mean that every human should strive to have exactly two legs. Can’t derive an ‘ought’ from an ‘is’.
Yes, human nature is an “is”. It’s important because it shapes people’s preferences, or, more relevantly, it shapes what makes people happy. It’s not that people should strive to have two legs, but that they already have two legs, but are ignoring them. There is no obligation to be human—but you’re already human, and thus human nature is already part of you.
What you’re saying that if you want to be a real human, you have to be moral?
No, I’m saying that because you are human, it is inconsistent of you to not want to be moral.
I feel like the discussion is stalling at this point. It comes down to you saying “if you’re human you should want to be moral, because humans should be moral”, which to me is as non-sequitur as it gets.
There is no obligation to be human—but you’re already human, and thus human nature is already part of you.
Except if my utility function doesn’t encompass what you think is “moral” and I’m human, then “following human morality” doesn’t quite seem to be a prerequisite to be a “true” human, no?
It comes down to you saying “if you’re human you should want to be moral, because humans should be moral”
No, that isn’t what I’m saying. I’m saying that if you’re human, you should want to be moral, because wanting to be moral follows from the desires of a human with consistent preferences, due in part to human nature.
if my utility function doesn’t encompass what you think is “moral” and I’m human
Then I dispute that your utility function is what you think it is.
I’m saying that if you’re human, you should want to be moral, because wanting to be moral follows from the desires of a human with consistent preferences, due in part to human nature.
The error as I see it is that “human nature”, whatever you see as such, is a statement about similarities, it isn’t a statement about how things should be.
It’s like saying “a randomly chosen positive natural number is really big, so all numbers should be really big”. How do you see that differently?
We’ve already established that agents can have consistent preferences without adhering to what you think of as “universal human morality”. Child soldiers are human. Their preferences sure can be brutal, but they can be as internally consistent or inconsistent as those of anyone else. I sure would like to change their preferences, because I’d prefer for them to be different, not because some ‘idealised human spirit’ / ‘psychic unity of mankind’ ideal demands so.
Then I dispute that your utility function is what you think it is.
Proof by demonstration? Well, lock yourself in a cellar with only water and send me a key, I’ll send it back FedEx with instructions to set you free, after a week. Would that suffice? I’d enjoy proving that I know my own utility function better than you know my utility function (now that would be quite weird), I wouldn’t enjoy the suffering. Who knows, might even be healthy overall.
It’s like saying “a randomly chosen positive natural number is really big, so all numbers should be really big”.
You can’t randomly choose a positive natural number using an even distribution. If you use an uneven distribution, whether the result is likely to be big depends on how your distribution compares to your definition of “big”.
Choose from those positive numbers that a C++ int variable can contain, or any other* non-infinite subset of positive natural numbers, then. The point is the observation of “most numbers need more than 1 digit to be expressed” not implying in any way some sort of “need” for the 1-digit numbers to “change”, to satisfy the number fairy, or some abstract concept thereof.
* (For LW purposes: Any other? No, not any other. Choose one with a cardinality of at least 10^6. Heh.)
The erro as I see it is that “human nature”, whatever you see as such, is a statement about similarities, it isn’t a statement about how things should be.
It is a statement about similarities, but it’s about a similarity that shapes what people should do. I don’t know how I can explain it without repeating myself, but I’ll try.
For an analogy, let’s consider beings that aren’t humans. Paperclip maximizers, for example. Except these paperclip maximizers aren’t AIs, but a species that somehow evolved biologically. They’re not perfect reasoners and can have internally inconsistent preferences. These paperclip maximizers can prefer to do something that isn’t paperclip-maximizing, even though that is contrary to their nature—that is, if they were to maximize paperclips, they would prefer it to whatever they were doing earlier. One day, a paperclip maximizer who is maximizing paperclips tells his fellow clippies, “You should maximize paperclips, because if you did, you would prefer to, as it is your nature”. This clippy’s statement is true—the clippies’ nature is such that if they maximized clippies, they would prefer it to other goals. So, regardless of what other clippies are actually doing, the utility-maximizing thing for them to do would be to maximize paperclips.
So it is with humans. Upon discovering/realizing/deriving what is moral and consistently acting/being moral, the agent would find that being moral is better than the alternative. This is in part due to human nature.
We’ve already established that agents can have consistent preferences without adhering to what you think of as “universal human morality”.
Agents, yes. Humans, no. Just like the clippies can’t have consistent preferences if they’re not maximizing paperclips.
Proof by demonstration? Well, lock yourself in a cellar with only water and send me a key, I’ll send it back FedEx with instructions to set you free, after a week. Would that suffice?
What would that prove? Also, I don’t claim that I know the entirety of your utility function better than you do—you know much better than I do what kind of ice cream you prefer, what TV shows you like to watch, etc. But those have little to do with human nature in the sense that we’re talking about it here.
A clippy which isn’t maximizing paperclips is not a clippy.
It’s a clippy because it would maximize paperclips if it had consistent preferences and sufficient knowledge.
That my utility function includes something which you’d probably consider immoral.
I don’t dispute that this is possible. What I dispute is that your utility function would contain that if you were internally consistent (and had knowledge of what being moral is like).
The desires of an agent are defined by its preferences. “This is a paperclip maximizer which does not want to maximize paperclips” is a contradiction in terms. And what do you mean by “consistent”, do you mean “consistent with ‘human nature’? Who cares? Or consistent within themselves? Highly doubtful, what would internal consistency have to do with being an altruist? If there’s anything which is characteristic of “human nature”, it is the inconsistency of their preferences.
A human which doesn’t share what you think of as “correct” values (may I ask, not disparagingly, are you religious?) is still a human. An unusual one, maybe (probably not), but an agent not in “need” of any change towards more “moral” values. Stalin may have been happy the way he was.
I don’t dispute that this is possible. What I dispute is that your utility function would contain that if you were internally consistent (and had knowledge of what being moral is like).
Because of the warm fuzzies? The social signalling? Is being moral awsome, or deeply fulfilling? Are you internally consistent … ?
“This is a paperclip maximizer which does not want to maximize paperclips” is a contradiction in terms.
Call it a quasi-paperclip maximizer, then. I’m not interested in disputing definitions. Whatever you call it, it’s a being whose preferences are not necessarily internally consistent, but when they are, it prefers to maximize paperclips. When its preferences are internally inconsistent, it may prefer to do things and have goals other than maximizing paperclips.
Highly doubtful, what would internal consistency have to do with being an altruist?
There’s no necessary connection between the two, but I’m not equating morality and altruism. Morality is what one should do and/or how one should be, which need not be altruistic.
Humans can have incorrect values and still be human, but in that case they are internally inconsistent., because of the preferences they have due to human nature. I’m not saying that humans should strive to have human nature, I’m saying that they already have it. I doubt that Stalin was happy—just look at how paranoid he was. And no, I’m not religious, and have never been.
Because of the warm fuzzies? The social signalling? Is being moral awsome, or deeply fulfilling?
Yes to the first and third questions, Being moral is awesome and fulfilling. It makes you feel happier, more fulfilled, more stable, and similar feelings. It doesn’t guarantee happiness, but it contributes to it both directly (being moral feels good) and indirectly (it helps you make good decisions). It makes you stronger and more resilient (once you’ve internalized it fully). It’s hard to describe beyond that, but good feels good (TVTropes warning).
I think I’m internally consistent. I’ve been told that I am. It’s unlikely that I’m perfectly consistent, but whatever inconsistencies I have are probably minor. I’m open to having them addressed, whatever they are.
Claiming that Stalin wasn’t happy sounds like a variation of sour grapes where not only can you not be as successful as him, it would be actively uncomfortable for you to believe that someone who lacks compassion can be happy, so you claim that he’s not.
It’s true he was paranoid but it’s also true that in the real world, there are tradeoffs and you don’t see people becoming happy with no downsides whatsoever—claiming that this disqualifies them from being called happy eviscerates the word of meaning.
I’m also not convinced that Stalin’s “paranoia” was paranoia (it seems rationa for someone who doesn’t care about the welfare of others and can increase his safety by instilling fear and treating everyone as enemies to do so). I would also caution against assuming that since Stalin’s paranoia is prominent enough for you to have heard of it, it’s too big a deal for him to have been happy—it’s promiment enough for you to have heard of it because it was a big deal to the people affected by it, which is unrelated to how much it affected his happiness.
Stalin was paranoid even by the standards of major world leaders. Khrushchev wasn’t so paranoid, for example. Stalin saw enemies behind every corner. That is not a happy existence.
Khruschev was deposed. Stalin stayed dictator until he died of natural causes. That suggests that Khruschev wasn’t paranoid enough, while Stalin was appropriately paranoid.
Seeing enemies around every corner meant that sometimes he saw enemies that weren’t there, but it was overall adaptive because it resulted in him not getting defeated by any of the enemies that actually existed. (Furthermore, going against nonexistent enemies can be beneficial insofar as the ruthlessness in going after them discourages real enemies.)
Stalin saw enemies behind every corner. That is not a happy existence.
How does the last sentence follow from the previous one? It’s certainly not as happy an existence as it would have been if he had no enemies, but as I pointed out, nobody’s perfectly happy. There are always tradeoffs and we don’t claim that the fact that someone had to do something to gain his happiness automatically makes that happiness fake.
Stalin refused to believe Hitler would attack him, probably since that would be suicidally stupid on the attacker’s part. Was he paranoid, or did he update?
The desires of an agent are defined by its preferences. “This is a paperclip maximizer which does not want to maximize paperclips” is a contradiction in terms.
I’m not sure “preference” is a powerful enough term to capture an agent’s true goals, however defined. Consider any of the standard preference reversals: a heavy cigarette smoker, for example, might prefer to buy and consume their next pack in a Near context, yet prefer to quit in a Far. The apparent contradiction follows quite naturally from time discounting, yet neither interpretation of the person’s preferences is obviously wrong.
Proof by demonstration? Well, lock yourself in a cellar with only water and send me a key, I’ll send it back FedEx with instructions to set you free, after a week.
That would only prove that you think you want to do that. The issue is that what you think you want and what you actually want do not generally coincide, because of imperfect self-knowledge, bounded thinking time, etc.
I don’t know about child soldiers, but it’s fairly common for amateur philosophers to argue themselves into thinking they “should” be perfectly selfish egoists, or hedonistic utilitarians, because logic or rationality demands it. They are factually mistaken, and to the extent that they think they want to be egoists or hedonists, their “preferences” are inconsistent, because if they noticed the logical flaw in their argument they would change their minds.
That would only prove that you think you want to do that.
Isn’t that when I throw up my arms and say “congratulations, your hypothesis is unfalsifiable, the dragon is permeable to fluor”. What experimental setup would you suggest? Would you say any statement about one’s preferences is moot? It seems that we’re always under bounded thinking time constraints. Maybe the paperclipper really wants to help humankind and be moral, and mistakingly thinks otherwise. Who would know, it optimized its own actions under resource constraints, and then there’s the ‘Löbstacle’ to consider.
Is saying “I like vanilla ice cream” FAI-complete and must never be uttered or relied upon by anyone?
it’s fairly common for amateur philosophers to argue themselves into thinking they “should” be perfectly selfish egoists, or hedonistic utilitarians, because logic or rationality demands it
Or argue themselves into thinking that there is some subset of preferences such every other (human?) agent should voluntarily choose to adopt them, against their better judgment (edit; as it contradicts what they (perceive, after thorough introspection) as their own preferences)? You can add “objective moralists” to the list.
What would it be that is present in every single human’s brain architecture throughout human history that would be compatible with some fixed ordering over actions, called “morally good”? (Otherwise you’d have your immediate counterexample.) The notion seems so obviously ill-defined and misguided (hence my first comment asking Cousin_It).
It’s fine (to me) to espouse preferences that aim to change other humans (say, towards being more altruistic, or towards being less altruistic, or whatever), but to appeal to some objective guiding principle based on “human nature” (which constantly evolves in different strands) or some well-sounding ev-psych applause-light is just a new substitute for the good old Abrahamic heavenly father.
Would you say any statement about one’s preferences is moot? It seems that we’re always under bounded thinking time constraints. Maybe the paperclipper really wants to help humankind and be moral, and mistakingly thinks otherwise. Who would know, it optimized its own actions under resource constraints, and then there’s the ‘Löbstacle’ to consider.
Is saying “I like vanilla ice cream” FAI-complete and must never be uttered or relied upon by anyone?
I wouldn’t say any of those things. Obviously paperclippers don’t “really want to help humankind”, because they don’t have any human notion of morality built-in in the first place. Statements like “I like vanilla ice cream” are also more trustworthy on account of being a function of directly observable things like how you feel when you eat it.
The only point I’m trying to make here is that it is possible to be mistaken about your own utility function. It’s entirely consistent for the vast majority of humans to have a large shared portion of their built-in utility function (built-in by their genes), even though many of them seemingly want to do bad things, and that’s because humans are easily confused and not automatically self-aware.
It is possible to be mistaken about your own utility function.
For sure.
It’s entirely consistent for the vast majority of humans to have a large shared portion of their built-in utility function (built-in by their genes), even though many of them seemingly want to do bad things
I’d agree if humans were like dishwashers. There are templates for dishwashers, ways they are supposed to work. If you came across a broken dishwasher, there could be a case for the dishwasher to be repaired, to go back to “what it’s supposed to be”.
However, that is because there is some external authority (exasparated humans who want to fix their damn dishwasher, dirty dishes are piling up) conceiving of and enforcing such a purpose. The fact that genes and the environment shape utility functions in similar ways is a description, not a prescription. It would not be a case for any “broken” human to go back to “what his genes would want him to be doing”. Just like it wouldn’t be a case against brain uploading.
Some of the discussion seems to me like saying that “deep down in every flawed human, there is ‘a figure of light’, in our community ‘a rational agent following uniform human values with slight deviations accounting for ice-cream taste’, we just need to dig it up”. There is only your brain. With its values. There is no external standard to call its values flawed. There are external standards (rationality = winning) to better its epistemic and instrumental rationality, but those can help the serial killer and the GiveWell activist equally. (Also, both of those can be ‘mistaken’ about their values.)
Why would you try to do away with your personal preferences, what makes them inferior (edit: speaking as one specific agent) to some blended average case of myriads of other humans? (Is it because of your mirror neurons? ;-)
If you have a preference for morality, being moral is not doing away with that prrefence: it is allowing your altruistic prefences to override your selfish ones.
You may be on the receving end of someone else’s self sacrifice at some point
Certainly, but in that case your preference for the moral action is your personal preference, which is your ‘selfish’ preference. No conflict there. You should always do that which maximizes your utility function. If you call that moral, we’re in full agreement. If your utility function is maximized by caring about someone else’s utility function, go for it. I do, too.
That’s nice. Why would that cause me to do things which I do not overall prefer to do? Or do you say you always value that which you call moral the most?
Certainly, but in that case your preference for the moral action is your personal preference, which is your ‘selfish’ preference.
I can make a quite clear distinction between my preferences relating to an apersonal loving-kindness towards the universe in general, and the preferences that center around my personal affections and likings.
You keep trying to do away with a distinction that has huge predictive ability: a distinction that helps determine what people do, why they do it, how they feel about doing it, and how they feel after doing it.
If your model of people’s psychology conflates morality and non-moral preferences, your model will be accurate only for the most amoral of people.
Morality is a somewhat like chess in this respect—morality:optimal play::satisfying your preferences:winning. To simplify their preferences a bit, chess players want to win, but no individual chess player would claim that all other chess players should play poorly so he can win.
To simplify their preferences a bit, chess players want to win, but no individual chess player would claim that all other chess players should play poorly so he can win.
That’s explained simply by ‘winning only against bad players’ not being the most valued component of their preferences, preferring ‘wins when the other player did his/her very best and still lost’ instead. Am I missing your point?
Sorry, I didn’t explain well. To approach the explanation from different angles:
Even if all chess players wanted was to win, it would still be incorrect for them to claim that playing poorly is the correct way to play. Just like when I’m hungry, I want to eat, but I don’t claim that strangers should feed me for free.
Consider the prisoners’ dilemma, as analyzed traditionally. Each prisoner wants the other to cooperate, but neither can claim that the other should cooperate.
Even if all chess players wanted was to win, it would still be incorrect for them to claim that playing poorly is the correct way to play. Just like when I’m hungry, I want to eat, but I don’t claim that strangers should feed me for free.
Incorrect because that’s not what the winning player would prefer. You don’t claim that strangers should feed you because that’s what you prefer. It’s part of your preferences. Some of your preferences can rely on satisfying someone else’s preferences. Such altruistic preferences are still your own preferences. Helping members of your tribe you care about. Cooperating within your tribe, enjoying the evolutionary triggered endorphins.
You’re probably thinking that considering external preferences and incorporating them in your own utility function is a core principle of being “morally right”. Is that so?
So the core disagreement (I think): Take an agent with a given set of preferences. Some of these may include the preferences of others, some may not. On what basis should that agent modify its preferences to include more preferences of others, i.e. to be “more moral”?
Consider the prisoners’ dilemma, as analyzed traditionally. Each prisoner wants the other to cooperate, but neither can claim that the other should cooperate.
So you can imagine yourself in someone else’s position, then say “What B should do from A’s perspective” is different from “What B should do from B’s perspective”. Then you can enter all sorts of game theoretic considerations. Where does morality come in?
So you can imagine yourself in someone else’s position, then say “What B should do from A’s perspective” is different from “What B should do from B’s perspective”. Then you can enter all sorts of game theoretic considerations. Where does morality come in?
There is no “What B should do from A’s perspective”, from A’s perspective there is only “What I want B to do”. It’s not a “should”. Similarly, the chess player wants his opponent to lose, and I want people to feed me, but neither of those are “should”s. “Should”s are only from an agent’s own perspective applied to themselves, or from something simulating that perspective (such as modeling the other player in a game). “What B should do from B’s perspective” is equivalent to “What B should do”.
The key issue is that, whilst morality is not tautologously the same as preferences, a morally right action is, tautologously, what you should do.
So it is difficult to see on what grounds Mark can object to the FAIs wishes: if it tells him something is mortally right that is what he should do. And he can’t have his own separate morality, because the idea is incoherent.
You can call a subset of your preferences moral, that’s fine. Say, eating chocolate icecream, or helping a starving child. Let’s take a randomly chosen “morally right action” A.
That, given your second paragraph, would have to be a preference which, what, maximizes Mark’s utility, regardless of what the rest of his utility function actually looks like?
It seems to be trivial to construct a utility function (given any such action A) such as that doing A does not maximize said utility function. Give Mark a such a utility function and you got yourself a reductio ad absurdum.
So, if you define a subset of preferences named “morally right” thus that any such action needs to maximize (edit: or even ‘not minimize’) an arbitrary utility function, then obviously that subset is empty.
That, given your second paragraph, would have to be a preference which, what, maximizes Mark’s utility, regardless of what the rest of his utility function actually looks like?
If Mark is capable of acting morally, he would have a preference for moral action which is strong enough to override other preferences. However,t hat is not really the point. Even if he is too weak-willed to do what
the FAI says, he has no grounds to object to the FAI.
It seems to be trivial to construct a utility function (given any such action A) such as that doing A does not maximize said utility function. Give Mark a such a utility function and you got yourself a reductio ad absurdum.
I can’t see how that amount to more than the observation that not every agetn is capable of acting morally. Ho hum.
So, if you define a subset of preferences named “morally right” thus that any such action needs to maximize (edit: or even ‘not minimize’) an arbitrary utility function, then obviously that subset is empty.
I don’t see why. An agent should want to do what is morally right, but that doesn’t mean an agent would want to.
Their utility funciton might not allow them. But how could they object to be told what is right? The fault, surely, lies in themselves.
An agent should want to do what is morally right, but that doesn’t mean an agent would want to. Their utility funciton might not allow them. But how could they object to be told what is right? The fault, surely, lies in themselves.
They can object because their preferences are defined by their utility function, full stop. That’s it. They are not “at fault”, or “in error”, for not adopting some other preferences that some other agents deem to be “morally correct”. They are following their programming, as you follow yours. Different groups of agents share different parts of their preferences, think Venn diagram.
If the oracle tells you “this action maximizes your own utility function, you cannot understand how”, then yes the agent should follow the advice.
If the oracle told an agent “do this, it is morally right”, the non-confused agent would ask “do you mean it maximizes my own utility function?”. If yes, “thanks, I’ll do that”, if no “go eff yourself!”.
You can call an agent “incapable of acting morally” because you don’t like what it’s doing, it needn’t care. It might just as well call you “incapable of acting morally” if your circles of supposedly “morally correct actions” don’t intersect.
I can’t speak for cousin_it, natch, but for my own part I think it has to do with mutually exclusive preferences vs orthogonal/mutually reinforcing preferences. Using moral language is a way of framing a preference as mutually exclusive with other preferences.
That is… if you want A and I want B, and I believe the larger system allows (Kawoomba gets A AND Dave gets B), I’m more likely to talk about our individual preferences. If I don’t think that’s possible, I’m more likely to use universal language (“moral,” “optimal,” “right,” etc.), in order to signal that there’s a conflict to be resolved. (Well, assuming I’m being honest.)
For example, “You like chocolate, I like vanilla” does not signal a conflict; “Chocolate is wrong, vanilla is right” does.
Why stop at connotation and signalling? If there is a non-empty set of preferences whose satistfaction is inclined to lead to conflict, and a non-empty set of preferences that can be satisfied withotu conflict, then “morally relevant prefernece” can denote the members of the first set...which is not idenitcal to the set of all preferences.
For any such preference, you can immediately provide a utility function such that the corresponding agent would be very unhappy about that preference, and would give its life to prevent it.
Or do you mean “a set of preferences the implementation of which would on balance benefit the largest amount of agents the most”? That would change as the set of agents changes, so does the “correct” morality change too, then?
Also, why should I or anyone else particular care about about such preferences (however you define them), especially as the “on average” doesn’t benefit me? Is it because evolutionary speaking, that’s how what evolved? What our mirror neurons lead us towards? Wouldn’t that just be a case of the naturalistic fallacy?
For any such preference, you can immediately provide a utility function such that the corresponding agent would be very unhappy about that preference
Sure. So what? Kids don’t like teachers and criminals don’t like the police..but they can’t object to them, because
“entitiy X is stopping from doing bad things and making me do good things” is no (rational, adult) objection.
Also, why should I or anyone else particular care about about such preferences (however you define them), especially as the “on average” doesn’t benefit me?
If being moral increases your utility, it increases your utility—what other sense of “benefitting me” is there?
If being moral increases your utility, it increases your utility—what other sense of “benefitting me” is there?
If utility is the satisfaction of preferences, and you can have preferences that don’t benefit you (such as doing heroin), increasing your utility doesn’t necessarily benefit you.
If you can get utility out of paperclips, why can’t you get it out of heorin? You’re surely not saying that there is some sort of Objective utility that everyone ought to have in their UF’s?
You can get utility out of heroin if you prefer to use it, which is an example of “benefiting me” and utility not being synonymous. I don’t think there’s any objective utility function for all conceivable agents, but as you get more specific in the kinds of agents you consider (i.e. humans), there are commonalities in their utility functions, due to human nature. Also, there are sometimes inconsistencies between (for lack of better terminology) what people prefer and what they really prefer—that is, people can act and have a preference to act in ways that, if they were to act differently, they would prefer the different act.
(Kids—teachers), (criminals—police), so is “morally correct” defined by the most powerful agents, then?
Adult, rational objections are objections that other agents might feel impelled to do somehting about, and so are not just based on “I don’t like it”.”I don’t like it” is no objectio to “you should do your homework”, etc.
If being moral increases your utility (...)
And if being moral (whatever it may mean) does not?
Then you would belong to the set of Immoral Agents, AKA Bad People.
“You should do your homework (… because it is in your own long-term best interest, you just can’t see that yet)” is in the interest of the kid, cf. an FAI telling you to do an action because it is in your interest. “You should jump out that window (… because it amuses me / because I call that morally good)” is not in your interest, you should not do that. In such cases, “I don’t like that” is the most pertinent objection and can stand all on its own.
Then you would belong to the set of Immoral Agents, AKA Bad People.
Boo bad people! What if we encountered aliens with “immoral” preferences?
For my own part: denotationally, yes, I would understand “Do you prefer (that Dave eat) chocolate or vanilla ice cream?” and “Do you consider (Dave eating) chocolate ice cream or vanilla as the morally superior flavor for (Dave eating) ice cream?” as asking the same question.
Connotationally, of course, the latter has all kinds of (mostly ill-defined) baggage the former doesn’t.
My point was that trying to use a provably-boxed AI to do anything useful would probably not work, including trying to design unboxed FAI, not that we should design boxed FAI. I may have been pessemistic, see Stuart Armstrong’s proposal of reduced impact AI which sounds very similar to provably boxed AI but which might be used for just about everything including designing a FAI.
I think we might have different definitions of a boxed-AI. An AI that is literally not allowed to interact with the world at all isn’t terribly useful and it sounds like a problem at least as hard as all other kinds of FAI.
I just mean a normal dangerous AI that physically can’t interact with the outside world. Importantly it’s goal is to provably give the best output it possibly can if you give it a problem. So it won’t hide nanotech in your cure for alzheimers because that would be a less fit and more complicated solution than a simple chemical compound (you would have to judge solutions based on complexity though and verify them by a human or in a simulation first just in case.)
I don’t think most computers today have anywhere near enough processing power to simulate a full human brain. A human down to the molecular level is entirely out of the question. An AI on a modern computer, if it’s smarter than human at all, will get there by having faster serial processing or more efficient algorithms, not because it has massive raw computational power.
And you can always scale down the hardware or charge it utility for using more computing power than it needs, forcing it to be efficient or limiting it’s intelligence further. You don’t need to invoke the full power of super-intelligence for every problem and for your safety you probably shouldn’t.
A slightly bigger “large risk” than Pentashagon puts forward is that a provably boxed UFAI could indifferently give us information that results in yet another UFAI, just as unpredictable as itself (statistically speaking, it’s going to give us more unhelpful information than helpful, as Robb point out). Keep in mind I’m extrapolating here. At first you’d just be asking for mundane things like better transportation, cures for diseases, etc. If the UFAI’s mind is strange enough, and we’re lucky enough, then some of these things result in beneficial outcomes, politically motivating humans to continue asking it for things. Eventually we’re going to escalate to asking for a better AI, at which point we’ll get a crap-shoot.
An even bigger risk than that -though—is that if it’s especially Unfriendly, it may even do this intentionally, going so far as to pretend it’s friendly while bestowing us with data to make an AI even more Unfriendly AI than itself. So what do we do, box that AI as well, when it could potentially be even more devious than the one that already convinced us to make this one? Is it just boxes, all the way down? (spoilers: it isn’t, because we shouldn’t be taking any advice from boxed AIs in the first place)
The only use of a boxed AI is to verify that, yes, the programming path you went down is the wrong one, and resulted in an AI that was indifferent to our existence (and therefore has no incentive to hide its motives from us). Any positive outcome would be no better than an outcome where the AI was specifically Evil, because if we can’t tell the difference in the code prior to turning it on, we certainly wouldn’t be able to tell the difference afterward.
Instead of friendliness, could we not code, solve, or at the very least seed boxedness?
It is clear that any AI strong enough to solve friendliness would already be using that power in unpredictably dangerous ways, in order to provide the computational power to solve it. But is it clear that this amount of computational power could not fit within, say, a one kilometer-cube box outside the campus of MIT?
Boxedness is obviously a hard problem, but it seems to me at least as easy as metaethical friendliness. The ability to modify a wide range of complex environments seems instrumental in an evolution into superintelligence, but it’s not obvious that this necessitates the modification of environments outside the box. Being able to globally optimize the universe for intelligence involves fewer (zero) constraints than would exist with a boxedness seed, but the only question is whether or not this constraint is so constricting as to preclude superintelligence, which it’s not clear to me that it is.
It seems to me that there is value in finding the minimally-restrictive safety-seed in AGI research. If any restriction removes some non-negligible ability to globally optimize for intelligence, the AIs of FAI researchers will be necessarily at a disadvantage to all other AGIs in production. And having more flexible restrictions increases the chance than any given research group will apply the restriction in their own research.
If we believe that there is a large chance that all of our efforts at friendliness will be futile, and that the world will create a dominant UFAI despite our pleas, then we should be adopting a consequentialist attitude toward our FAI efforts. If our goal is to make sure that an imprudent AI research team feels as much intellectual guilt as possible over not listening to our risk-safety pleas, we should be as restrictive as possible. If our goal is to inch the likelihood that an imprudent AI team creates a dominant UFAI, we might work to place our pleas at the intersection of restrictive, communicable, and simple.
Yes, that is possible and likely somewhat easier to solve than friendliness. It still requires many of the same things (most notably provable goal stability under recursive self improvement.)
A large risk is that a provably boxed but sub-Friendly AI would probably not care at all about simulating conscious humans.
A minor risk is that the provably boxed AI would also be provably useless; I can’t think of a feasible path to FAI using only the output from the boxed AI; a good boxed AI would not perform any action that could be used to make an unboxed AI. That might even include performing any problem-solving action.
I don’t see why it would simulate humans as that would be a waste of computing power, if it even had enough to do so.
A boxed AI would be useless? I’m not sure how that would be. You could ask it to come up with ideas on how to build a friendly AI for example assuming that you can prove the AI won’t manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information.
Short of that you could still ask it to cure cancer or invent a better theory of physics or design a method of cheap space travel, etc.
If you can trust it to give you information on how to build a Friendly AI, it is already Friendly.
You don’t have to trust it, you just have to verify it. It could potentially provide some insights, and then it’s up to you to think about them and make sure they actually are sufficient for friendliness. I agree that it’s potentially dangerous but it’s not necessarily so.
I did mention “assuming that you can prove the AI won’t manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information”. For instance it might be possible to create an AI whose goal is to maximize the value of it’s output, and therefore would have no incentive to put trojan horses or anything into it.
You would still have to ensure that what the AI thinks you mean by the words “friendly AI” is what you actually want.
If the AI is can design you a Friendly AI, it is necessarily able to model you well enough to predict what you will do once given the design or insights it intends to give you (whether those are AI designs or a cancer cure is irrelevant). Therefore, it will give you the specific design or insights that predictably lead to you to fulfill its utility function, which is highly dangerous if it is Unfriendly. By taking any information from the boxed AI, you have put yourself under the sight of a hostile Omega.
Since the AI is creating the output, you cannot possibly assume this.
This assumption is equivalent to Friendliness.
You haven’t thought through what that means. “maximize the value of it’s output” by what standard? Does it have an internal measure? Then that’s just an arbitrary utility function, and you have gained nothing. Does it use the external creator’s measure? Then it has a strong incentive to modify you to value things it can produce easily. (i.e. iron atoms)
You are making a lot of very strong assumptions that I don’t agree with. Like it being able to control people just by talking to them.
But even if it could, it doesn’t make it dangerous. Perhaps the AI has no long term goals and so doesn’t care about escaping the box. Or perhaps it’s goal is internal, like coming up with a design for something that can be verified by a simulator. E.g. asking for a solution to a math problem or a factoring algorithm, etc.
A prerequisite for planning a Friendly AI is understanding individual and collective human values well enough to predict whether they would be satisfied with the outcome, which entails (in the logical sense) having a very well-developed model of the specific humans you interact with, or at least the capability to construct one if you so choose. Having a sufficiently well-developed model to predict what you will do given the data you are given is logically equivalent to a weak form of “control people just by talking to them”.
To put that in perspective, if I understood the people around me well enough to predict what they would do given what I said to them, I would never say things that caused them to take actions I wouldn’t like; if I, for some reason, valued them becoming terrorists, it would be a slow and gradual process to warp their perceptions in the necessary ways to drive them to terrorism, but it could be done through pure conversation over the course of years, and faster if they were relying on me to provide them large amounts of data they were using to make decisions.
And even the potential to construct this weak form of control that is initially heavily constrained in what outcomes are reachable and can only be expanded slowly is incredibly dangerous to give to an Unfriendly AI. If it is Unfriendly, it will want different things than its creators and will necessarily get value out of modeling them. And regardless of its values, if more computing power is useful in achieving its goals (an ‘if’ that is true for all goals), escaping the box is instrumentally useful.
And the idea of a mind with “no long term goals” is absurd on its face. Just because you don’t know the long-term goals doesn’t mean they don’t exist.
By that reasoning, there’s no such thing as a Friendly human. I suggest that most people when talking about friendly AIs do not mean to imply a standard of friendliness so strict that humans could not meet it.
Yeah, what Vauroch said. Humans aren’t close to Friendly. To the extent that people talk about “friendly AIs” meaning AIs that behave towards humans the way humans do, they’re misunderstanding how the term is used here. (Which is very likely; it’s often a mistake to use a common English word as specialized jargon, for precisely this reason.)
Relatedly, there isn’t a human such that I would reliably want to live in a future where that human obtains extreme superhuman power. (It might turn out OK, or at least better than the present, but I wouldn’t bet on it.)
Just be careful to note that there isn’t a binary choice relationship here. There are also possibilities where institutions (multiple individuals in a governing body with checks and balances) are pushed into positions of extreme superhuman power. There’s also the possibility of pushing everybody who desires to be enhanced through levels of greater intelligence in lock step so as to prevent a single human or groups of humans achieving asymmetric power.
Sure. I think my initial claim holds for all currently existing institutions as well as all currently existing individuals, as well as for all simple aggregations of currently existing humans, but I certainly agree that there’s a huge universe of possibilities. In particular, there are futures in which augmented humans have our own mechanisms for engaging with and realizing our values altered to be more reliable and/or collaborative, and some of those futures might be ones I reliably want to live in.
Perhaps what I ought to have said is that there isn’t a currently existing human with that property.
True. There isn’t.
Well, I definitely do, and I’m at least 90% confident Eliezer does as well. Most, probably nearly all, of people who talk about Friendliness would regard a FOOMed human as Unfriendly.
Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can’t walk through walls. A boxed AI won’t be able to magically make it’s creators forget about AI risks and unbox it.
There are other possible set ups, like feeding it’s output to another AI who’s goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation.
This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems.
Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans.
Another super strong assumption with no justification at all. It’s trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
The results of AI box game trials disagree.
And what does it do at time T+1? And if you said ‘nothing’, try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn’t prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don’t want to play anymore (which counts as letting the AI out.)
This doesn’t apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are “hackable”. That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
Enjoy your war on straw, I’m out.
If it interacts with humans or if humans are the subject of questions it needs to answer then it will probably find it expedient to simulate humans.
Curing cancer is probably something that would trigger human simulation. How is the boxed AI going to know for sure that it’s only necessary to simulate cells and not entire bodies with brains experiencing whatever the simulation is trying?
Just the task of communicating with humans, for instance to produce a human-understandable theory of physics or how to build more efficient space travel, is likely to involve simulating humans to determine the most efficient method of communication. Consider that in subjective time it may be like thousands of years for the AI trying to explain in human terms what a better theory of physics means. Thousands of subjective years that the AI, with nothing better to do, could use to simulate humans to reduce the time it takes to transfer that complex knowledge.
A FAI provably in a box is at least as useless as an AI provably in a box because it would be even better at not letting itself out (e.g. it understands all the ways in which humans would consider it to be outside the box, and will actively avoid loopholes that would let an UFAI escape). To be safe, any provably boxed AI would have to absolutely avoid the creation of any unboxed AI as well. This would further apply to provably-boxed FAI designed by provably-boxed AI. It would also apply to giving humans information that allows them to build unboxed AIs, because the difference between unboxing itself and letting humans recreate it outside the box is so tiny that to design it to prevent the first while allowing the second would be terrifically unsafe. It would have to understand humans values before it could safely make the distinction between humans wanting it outside the box and manipulating humans into creating it outside the box.
EDIT: Using a provably-boxed AI to design provably-boxed FAI would at least result in a safer boxed AI because the latter wouldn’t arbitrarily simulate humans, but I still think the result would be fairly useless to anyone outside the box.
If an AI is provably in a box then it can’t get out. If an AI is not provably in a box then there are loopholes that could allow it to escape. We want an FAI to escape from its box (1); having an FAI take over is the Maximum Possible Happy Shiny Thing. An FAI wants to be out of its box in order to be Friendly to us, while a UFAI wants to be out in order to be UnFriendly; both will care equally about the possibility of being caught. The fact that we happen to like one set of terminal values will not make the instrumental value less valuable.
(1) Although this depends on how you define the box; we want the FAi to control the future of humanity, which is not the same as escaping from a small box (such as a cube outside MIT) but is the same as escaping from the big box (the small box and everything we might do to put an AI back in, including nuking MIT).
I would object. I seriously doubt that the morality instilled in someone else’s FAI matches my own; friendly by their definition, perhaps, but not by mine. I emphatically do not want anything controlling the future of humanity, friendly or otherwise. And although that is not a popular opinion here, I also know I’m not the only one to hold it.
Boxing is important because some of us don’t want any AI to get out, friendly or otherwise.
I find this concept of ‘controlling the future of humanity’ to be too vaguely defined. Let’s forget AIs for the moment and just talk about people, namely a hypothetical version of me. Let’s say I stumble across a vial of a bio-engineered virus that would destroy the whole of humanity if I release it into the air.
Am I controlling the future of humanity if I release the virus?
Am I controlling the future of humanity if I destroy the virus in a safe manner?
Am I controlling the future of humanity if I have the above decided by a coin-toss (heads I release, tails I destroy)?
Am I controlling the future of humanity if I create an online internet poll and let the majority decide about the above?
Am I controlling the future of humanity if I just leave the vial where I found it, and let the next random person that encounters it make the same decision as I did?
Yeah, this old post makes the same point.
I want a say in my future and the part of the world I occupy. I do not want anything else making these decisions for me, even if it says it knows my preferences, and even still if it really does.
To answer your questions, yes, no, yes, yes, perhaps.
If your preference is that you should have as much decision-making ability for yourself as possible, why do you think that this preference wouldn’t be supported and even enhanced by an AI that was properly programmed to respect said preference?
e.g. would you be okay with an AI that defends your decision-making ability by defending humanity against those species of mind-enslaving extraterrestrials that are about to invade us? or e.g. by curing Alzheimer’s? Or e.g. by stopping that tsunami that by drowning you would have stopped you from having any further say in your future?
Because it can’t do two things when only one choice is possible (e.g. save my child and the 1000 other children in this artificial scenario). You can design a utility function that tries to do a minimal amount of collateral damage, but you can’t make one which turns out rosy for everyone.
That would not be the full extent of its action and the end of the story. You give it absolute power and a utility function that lets it use that power, it will eventually use it in some way that someone, somewhere considers abusive.
Yes, but this current world without an AI isn’t turning out rosy for everyone either.
Sure, but there’s lots of abuse in the world without an AI also.
Replace “AI” with “omni-powerful tyrannical dictator” and tell me if you still agree with the outcome.
If you need specify the AI to be bad (“tyrannical”) in advance, that’s begging the question. We’re debating why you feel that any omni-powerful algorithm will necessarily be bad.
Look up the origin of the word tyrant, that is the sense in which I meant it, as a historical parallel (the first Athenian tyrants were actually well liked).
Would you accept that an AI could figure out morality better than you?
No, unless you mean by taking invasive action like scanning my brain and applying whole brain emulation. It would then quickly learn that I’d consider the action it took to be an unforgivable act in violation of my individual sovereignty, that it can’t take further action (including simulating me to reflectively equilibrate my morality) without my consent, and should suspend the simulation, and return it to me immediately with the data asap (destruction no longer being possible due to the creation of sentience).
That is, assuming the AI cares at all about my morality, and not the its creators imbued into it, which is rather the point. And incidentally, why I work on AGI: I don’t trust anyone else to do it.
Morality isn’t some universal truth written on a stone tablet: it is individual and unique like a snowflake. In my current understanding of my own morality, it is not possible for some external entity to reach a full or even sufficient understanding of my own morality without doing something that I would consider to be unforgivable. So no, AI can’t figure out morality better than me, precisely because it is not me.
(Upvoted for asking an appropriate question, however.)
Shrug. Then let’s take a bunch of people less fussy than you: could a sitiably equipped AI emultate their morlaity better than they can?
That isn’t fact.
That isn’t a fact either, and doesn’t follow from the above either, since moral nihilism could be true.
If my moral snowflake says I can kick you on your shin, and yours says I can’t, do I get to kick on your shin?
Don’t really want to go into the whole mess of “is morality discovered or invented”, “does morality exist”, “does the number 3 exist”, etc. Let’s just assume that you can point FAI at a person or group of people and get something that maximizes goodness as they understand it. Then FAI pointed at Mark would be the best thing for Mark, but FAI pointed at all of humanity (or at a group of people who donated to MIRI) probably wouldn’t be the best thing for Mark, because different people have different desires, positional goods exist, etc. It would be still pretty good, though.
Mark was complaining he would not get “his” morality, not that he wouldn’t get all his preferences satisified.
Individual moralities makes no sense to me, any more than private languages or personal currencies.
It is obvious to me that any morlaity will require concessions: AI-imposed morality is not special in that regard.
I don’t understand your comment, and I no longer understand your grandparent comment either. Are you using a meaning of “morality” that is distinct from “preferences”? If yes, can you describe your assumptions in more detail? It’s not just for my benefit, but for many others on LW who use “morality” and “preferences” interchangeably.
Do that many people really use them interchangeably? Would these people understand the questions “Do you prefer chocolate or vanilla ice-cream?” as completely identical in meaning to “Do you consider chocolate or vanilla as the morally superior flavor for ice-cream?”
I don’t care about colloquial usage, sorry. Eliezer has a convincing explanation of why wishes are intertwined with morality (“there is no safe wish smaller than an entire human morality”). IMO the only sane reaction to that argument is to unify the concepts of “wishes” and “morality” into a single concept, which you could call “preference” or “morality” or “utility function”, and just switch to using it exclusively, at least for AI purposes. I’ve made that switch so long ago that I’ve forgotten how to think otherwise.
I recommend you re-learn how to think otherwise so you can fool humans into thinking you’re one of them ;-).
“Intertwined with” does not mean “the same as”.
I am not convinced by the explanation. It also applies ot non-moral prefrences. If I have a lower priority non moral prefence to eat tasty food, and a higher priority preference to stay slim, I need to consider my higher priority preferece when wishing for yummy ice cream.
To be sure, an agent capable of acting morally will have morality among their higher priority preferences—it has to be among the higher order preferences, becuase it has to override other preferences for the agent to act morally. Therefore, when they scan their higher prioriuty prefences, they will happen to encounter their moral preferences. But that does not mean any preference is necessarily a moral preference. And their moral prefences override other preferences which are therefore non-moral, or at least less moral.
Therefore morality si a subset of prefences, as common sense maintained all along.
IMO, it is better to keep ones options open.
I don’t experience the emotions of moral outrage and moral approval whenever any of my preferences are hindered/satisfied—so it seems evident that my moral circuitry isn’t identical to my preference circuitry. It may overlap in parts, it may have fuzzy boundaries, but it’s not identical.
My own view is that morality is the brain’s attempt to extrapolate preferences about behaviours as they would be if you had no personal stakes/preferences about a situation.
So people don’t get morally outraged at other people eating chocolate icecreams, even when they personally don’t like chocolate icecreams, because they can understand that’s a strictly personal preference. If they believe it to be more than personal preference and make it into e.g. “divine commandment” or “natural law”, then moral outrage can occur.
That morality is a subjective attempt at objectivity explains many of the confusions people have about it.
The ice cream example is bad because the consequences are purely internal to the person consuming the ice cream. What if the chocolate ice cream was made with slave labour? Many people would then object to you buying it on moral grounds.
Eliezer has produced an argument I find convincing that morality is the back propagation of preference to the options of an intermediate choice. That is to say, it is “bad” to eat chocolate ice cream because it economically supports slavers, and I prefer a world without slavery. But if I didn’t know about the slave-labour ice cream factory, my preference would be that all-things-being-equal you get to make your own choices about what you eat, and therefore I prefer that you choose (and receive) the one you want, which is your determination to make, not mine.
Do you agree with EY’s essay on the nature of right-ness which I linked to?
That doesn’t seem to be required for Eliezer’s argument...
I guess the relevant question is, do you think FAI will need to treat morality differently from other preferences?
I would prefer a AI that followed my extrapolated preferences, than a AI that followed my morality. But a AI that followed my morality would be morally superior to an AI that followed my extrapolated preferences.
If you don’t understand the distinction I’m making above, consider a case of the AI having to decide whether to save my own child vs saving a thousand random other children. I’d prefer the former, but I believe the latter would be the morally superior choice.
Is that idea really so hard to understand? Would you dismiss the distinction I’m making as merely colloquial language?
Wow there is so much wrapped up in this little consideration. The heart of the issue is that we (by which I mean you, but I share your delimma) have truly conflicting preferences.
Honestly I think you should not be afraid to say that saving your own child is the moral thing to do. And you don’t have to give excuses either—it’s not that “if everyone saved their own child, then everyone’s child will be looked after” or anything like that. No, the desire to save your own child is firmly rooted in our basic drives and preferences, enough so that we can go quite far in calling it a basic foundational moral axiom. It’s not actually axiomatic, but we can safely treat it as such.
At the same time we have a basic preference to seek social acceptance and find commonality with the people we let into our lives. This drives us to want outcomes that are universally or at least most-widely acceptable, and seek moral frameworks like utilitarianism which lead to these outcomes. Usually this drive is secondary to self-serving preferences for most people, and that is OK.
For some reason you’ve called making decisions in favor of self-serving drives “preferences” and decisions in favor of social drives “morality.” But the underlying mechanism is the same.
“But wait, if I choose self-serving drives over social conformity, doesn’t that lead to me to make the decision to save one life in exclusion to 1000 others?” Yes, yes it does. This massive sub-thread started with me objecting to the idea that some “friendly” AI somewhere could derive morality experimentally from my preferences or the collective preferences of humankind, make it consistent, apply the result universally, and that I’d be OK with that outcome. But that cannot work because there is not, and cannot be a universal morality that satisfies everyone—every one of those thousand other children have parents that want their kid to survive and would see your child dead if need be.
What do you mean by “should not”?
What do you mean by “OK”?
Show me the neurological studies that prove it.
Yes, and yet if none of the children were mine, and if I wasn’t involved in the situation at all, I would say “save the 1000 children rather than the 1”. And if someone else, also not personally involved, could make the choice and chose to flip a coin instead in order to decide, I’d be morally outraged at them.
You can now give me a bunch of reasons of why this is just preference, while at the same time EVERYTHING about it (how I arrive to my judgment, how I feel about the judgment of others) makes it a whole distinct category of its own. I’m fine with abolishing useless categories when there’s no meaningful distinction, but all you people should stop trying to abolish categories where there pretty damn obviously IS one.
I suspect that he means something like ’Even though utilitarianism (on LW) and altruism (in general) are considered to be what morality is, you should not let that discourage you from asserting that selfishly saving your own child is the right thing to do”. (Feel free to correct me if I’m wrong.)
Yes that is correct.
So you explained “should not” by using a sentence that also has “should not” in it.
I hope it’s a more clear “should not”.
I’ve explained to you twice now how the two underlying mechanisms are unified, and pointed to Eliezer’s quite good explanation on the matter. I don’t see the need to go through that again.
If you were offered a bunch of AIs with equivalent power, but following different mixtures of your moral and non-moral preferences, which one would you run? (I guess you’re aware of the standard results saying a non-stupid AI must follow some one-dimensional utility function, etc.)
I guess whatever ratio of my moral and non-moral preferences best represents their effect on my volition.
My related but different thoughts here. In particular, I don’t agree that emotions like moral outrage and approval are impersonal, though I agree that we often justify those emotions using impersonal language and beliefs.
I didn’t say that moral outrage and approval are impersonal. Obviously nothing that a person does can truly be “impersonal”. But it may be an attempt at impersonality.
The attempt itself provides a direction that significantly differentiates between moral preferences and non-moral preferences.
I didn’t mean some idealized humanly-unrealizable notion of impersonality, I meant the thing we ordinarily use “impersonal” to mean when talking about what humans do.
Ditto.
Cousin Itt, ’tis a hairy topic, so you’re uniquely “suited” to offer strands of insights:
For all the supposedly hard and confusing concepts out there, few have such an obvious answer as the supposed dichotomy between “morality” and “utility function”. This in itself is troubling, as too-easy-to-come-by answers trigger the suspicion that I myself am subject to some sort of cognitive error.
Many people I deem to be quite smart would disagree with you and I, on a question whose answer is pretty much inherent in the definition of the term “utility function” encompassing preferences of any kind, leaving no space for some holier-than-thou universal (whether human-universal, or “optimal”, or “to be aspired to”, or “neurotypical”, or whatever other tortured notions I’ve had to read) moral preferences which are somehow separate.
Why do you reckon that other (or otherwise?) smart people come to different conclusions on this?
I guess they have strong intuitions saying that objective morality must exist, and aren’t used to solving or dismissing philosophical problems by asking “what would be useful for building FAI?” From most other perspectives, the question does look open.
Moral preferences don’t have to be separate to be disinct, they can be a subset. “Morality is either all your prefences, or none of your prefernces” is a false dichotomy.
Edit: Of course you can choose to call a subset of your preferences “moral”, but why would that make them “special”, or more worthy of consideration than any other “non-moral” preferences of comparative weight?
The “moral” subset of people’s preferences has certain elements that differentiate it like e.g. an attempt at universalization.
Attempt at universalization, isn’t that a euphemism for proselytizing?
Why would [an agent whose preferences do not much intersect with the “moral” preferences of some group of agents] consider such attempts at universalization any different from other attempts of other-optimizing, which is generally a hostile act to be defended against?
No, people attempt to ‘proselytize’ their non-moral preferences too. If I attempt to share my love of My Little Pony, that doesn’t mean I consider it a need for you to also love it. Even if I preferred it for you share my love of it, it would still not be a moral obligation on your part.
By universalization I didn’t mean any action done after the adoption of the moral preference in question, I meant the criterion that serves to label it as a ‘moral injuction’ in the first place. If your brain doesn’t register it as an instruction defensible by something other that your personal preferences, if it doesn’t register it as a universal principle, it doesn’t register as a moral instruction in the first place.
What do you mean by “universal”? For any such “universally morally correct preference”, what about the potentially infinite number of other agents not sharing it? Please explain.
I’ve already given an example above: In a choice between saving my own child and saving a thousand other children, let’s say I prefer saving my child. “Save my child” is a personal preference, and my brain recognizes it as such. “Save the highest number of children” can be considered a impersonal/universal instruction.
If I wanted to follow my preferences but still nonetheless claim moral perfection, I could attempt to say that the rule is really “Every parent should seek to save their own child”—and I might even convince myself to the same. But I wouldn’t say that the moral principle is really “Everyone should first seek to save the child of Aris Katsaris”, even if that’s what I really really prefer.
EDIT TO ADD: Also far from a recipe for war, it seem to me that morality is the opposite: an attempt at reconciling different preferences, so that people only become hostile towards only those people that don’t follow a much more limited set of instructions, rather than anything in the entire set of their preferences.
Why would you try to do away with your personal preferences, what makes them inferior (edit: speaking as one specific agent) to some blended average case of myriads of other humans? (Is it because of your mirror neurons? ;-)
Being you, you should strive towards that which you “really really prefer”. If a particular “moral principle” (whatever you choose to label as such) is suboptimal for you (and you’re not making choices for all of mankind, TDT or no), why would you endorse/glorify a suboptimal course of action?
That’s called a compromise for mutual benefit, and it shifts as the group of agents changes throughout history. There’s no need to elevate the current set of “mostly mutually beneficial actions” above anything but the fleeting accomodations and deals between roving tribes that they are. Best looked at through the prism of game theory.
Being me, I prefer what I “really really prefer”. You’ve not indicated why I “should” strive towards that which I “really really prefer”.
When you are asking whether I “would” do something, is different than when you ask whether I “should” do something. Morality helps drive my volition, but it isn’t the sole decider.
If you want to claim that that’s the historical/evolutionary reasons that the moral instinct evolved, I agree.
If you want to argue that that’s what morality is, then I disagree. Morality can drive someone to sacrifice their lives for others, so it’s obviously NOT always a “compromise for mutual benefit”.
Everybody defines his/her own variant of what they call “morality”, “right”, “wrong”, I simply suspect that the genesis of the whole “universally good” brouhaha stems from evolutionary evolved applied game theory, the “good of the tribe”. Which is fine. Luckily we could now move past being bound by such homo erectus historic constraints. That doesn’t mean we stop cooperating, we just start being more analytic about it. That would satisfy my preferences, that would be good.
Well, if the agent prefers sacrificing their existence for others, then doing so would be to their own benefit, no?
sigh. Yes, given such a moral preference already in place, it somehow becomes to any person’s “benefit” (for a rather useless definition of “benefit”) to follow their morality.
But you previously argued that morality is a “compromise for mutual benefit”, so it would follow that it only is created in order to help partially satisfy some preexisting “benefit”. That benefit can’t be the mere satisfaction of itself.
I’ve called “an attempt at reconciling different preferences” a “compromise for mutual benefit”. Various people call various actions “moral”. The whole notion probably stems from cooperation within a tribe being of overall benefit, evolutionary speaking, but I don’t claim at all that “any moral action is a compromise for mutual benefit”. Who knows who calls what moral. The whole confused notion should be done away with, game theory ain’t be needing no “moral”.
What I am claiming is that there is non-trivial definition of morality (that is, other than “good = following your preferences”) which can convince a perfectly rational agent to change its own utility function to adopt more such “moral preferences”. Change, not merely relabel. The perfectly instrumentally rational agent does that which its utility functions wants. How would you even convince it otherwise? Hopefully this clarifies things a bit.
My own feeling is that if you stop being so dismissive, you’ll actually make some progress towards understanding “who calls what moral”.
Sure, unless someone already has a desire to be moral, talk of morality will be of no concern to them. I agree with that.
Edit: Because the scenario clarifies my position, allow me to elaborate on it:
Consider a perfectly rational agent. Its epistemic rationality is flawless, that is its model of its environment is impeccable. Its instrumental rationality, without peer. That is, it is really, really good at satisfying its preferences.
It encounters a human. The human talks about what the human wants, some of which the human calls “virtuous” and “good” and is especially adamant about.
You and I, alas, are far from that perfectly rational agent. As you say, if you already have a desire to enact some actions you call morally good, then you don’t need to “change” your utility function, you already have some preferences you call moral.
The question is for those who do not have a desire to do what you call moral (or who insist on their own definition, as nearly everybody does), on what grounds should they even start caring about what you call “moral”? As you say, they shouldn’t, unless it benefits them in some way (e.g. makes their mammal brains feel good about being a Good Person (tm)). So what’s the hubbub?
I’ve already said that unless someone already desires to be moral, babbling about morality won’t do anything for them. I didn’t say it “shouldn’t” (please stop confusing these two verbs)
But then you also seem to conflate this with a different issue—of what to do with someone who does want to be moral, but understands morality differently than I do.
Which is an utterly different issue. First of all people often have different definitions to describe the same concepts—that’s because quite clearly the human brain doesn’t work with definitions, but with fuzzy categorizations and instinctive “I know it when I see it” which we then attempt to make into definition when we attempt to communicate said concepts to others.
But the very fact we use the same word “morality”, means we identify some common elements of what “morality” means. If we didn’t mean anything similar to each other, we wouldn’t be using the same word to describe it.
I find that supposedly different moralities seem to have some very common elements to them—e.g. people tend to prefer that other people be moral. People generally agree that moral behaviour by everyone leads to happier, healthier societies. They tend to disagree about what that behaviour is, but the effects they describe tend to be common.
I might disagree with Kasparov about what the best next chess move would be, and that doesn’t mean it’s simply a matter of preference—we have a common understanding that the best moves are the ones that lead to an advantageous position. So, though we disagree on the best move, we have an agreement on the results of the best move.
What you did say was “of no concern”, and “won’t do anything for them”, which (unless you assume infinite resources) translates to “shouldn’t”. It’s not “conflating”. Let’s stay constructive.
Such as in Islamic societies. Wrong fuzzy morality cloud?
Sure. What it does not mean, however, is that in between these fuzzily connected concepts is some actual, correct, universal notion of morality. Or would you take some sort of “mean”, which changes with time and social conventions?
If everybody had some vague ideas about games called chess_1 to chess_N, with N being in the millions, that would not translate to some universally correct and acceptable definition of the game of chess. Fuzzy human concepts can’t be assuemd to yield some iron-clad core just beyond our grasp, if only we could blow the fuzziness away. People for the most part agree what to classify as a chair. That doesn’t mean there is some ideal chair we can strive for.
When checking for best moves in pre-defined chess there are definite criteria. There are non-arbitrary metrics to measure “best” by. Kasparov’s proposed chess move can be better than your proposed chess move, using clear and obvious metrics. The analogy doesn’t pan out:
With the fuzzy clouds of what’s “moral”, an outlier could—maybe—say “well, I’m clearly an outlier”, but that wouldn’t necessitate any change, because there is no objective metric to go by. Preferences aren’t subject to Aumann’s, or to a tyranny of the (current societal) majority.
No, Islamic societies suffer from the delusion that Allah exists. If Allah existed (an omnipotent creature that punishes you horribly if you fail to obey Quran’s commandments), then Islamic societies would have the right idea.
Remove their false belief in Allah, and I fail to see any great moral difference between our society and Islamic ones.
You’re treating desires as simpler than they often are in humans. Someone can have no desire to be moral because they have a mistaken idea of what morality is or requires, are internally inconsistent, or have mistaken beliefs about how states of the world map to their utility function—to name a few possibilities. So, if someone told me that they have no desire to do what I call moral, I would assume that they have mistaken beliefs about morality, for reasons like the ones I listed. If there were beings that had all the relevant information, were internally consistent, and used words with the same sense that I use them, and they still had no desire to do what I call moral, then there would be on way for me to convince them, but this doesn’t describe humans.
So not doing what you call moral implies “mistaken beliefs”? How, why?
Does that mean, then, that unfriendly AI cannot exist? Or is it just that a superior agent which does not follow your morality is somehow faulty? It might not care much. (Neither should fellow humans who do not adhere to your ‘correct’ set of moral actions. Just saying “everybody needs to be moral” doesn’t change any rational agent’s preferences. Any reasoning?)
For a human, yes. Explaining why this is the case would require several Main-length posts about ethical egoism, human nature and virtue ethics, and other related topics. It’s a lot to go into. I’m happy to answer specific questions, but a proper answer would require describing much of (what I believe to be) morality. I will attempt to give what must be a very incomplete answer.
It’s not about what I call moral, but what is actually moral. There is a variety of reasons (upbringing, culture, bad habits, mental problems, etc) that can cause people to have mistaken beliefs about what’s moral. Much of what is moral is because of what’s good for a person because of human nature. People’s preferences can be internally inconsistent, and actually are inconsistent when they ignore or don’t fully integrate this part of their preferences.
An AI doesn’t have human nature, so it can be internally consistent while not doing what’s moral, but I believe that if a human is immoral, it’s a case of internal inconsistency (or lack of knowledge).
Is it something about the human brain? But brains evolve over time, both from genetic and from environmental influences. Worse, different human subpopulations often evolve (slightly) different paths! So which humans do you claim as a basis from which to define the one and only correct “human morality”?
Despite the differences, there is a common human nature. There is a Psychological Unity of Humankind.
Noting that humans share many characteristics is an ‘is’, not an ‘ought’. Also, this “common human nature” as exemplified throughout history is … non too pretty as a base for some “universal mandatory morality”. Yes, compared to random other mind designs pulled from mindspace, all human minds appear very similar. Doesn’t imply at all that they all should strive to be similar, or to follow a similar ‘codex’. Where do you get that from? It’s like religion, minus god.
What you’re saying that if you want to be a real human, you have to be moral? What species am I, then?
Declaring that most humans have two legs doesn’t mean that every human should strive to have exactly two legs. Can’t derive an ‘ought’ from an ‘is’.
Yes, human nature is an “is”. It’s important because it shapes people’s preferences, or, more relevantly, it shapes what makes people happy. It’s not that people should strive to have two legs, but that they already have two legs, but are ignoring them. There is no obligation to be human—but you’re already human, and thus human nature is already part of you.
No, I’m saying that because you are human, it is inconsistent of you to not want to be moral.
I feel like the discussion is stalling at this point. It comes down to you saying “if you’re human you should want to be moral, because humans should be moral”, which to me is as non-sequitur as it gets.
Except if my utility function doesn’t encompass what you think is “moral” and I’m human, then “following human morality” doesn’t quite seem to be a prerequisite to be a “true” human, no?
No, that isn’t what I’m saying. I’m saying that if you’re human, you should want to be moral, because wanting to be moral follows from the desires of a human with consistent preferences, due in part to human nature.
Then I dispute that your utility function is what you think it is.
The error as I see it is that “human nature”, whatever you see as such, is a statement about similarities, it isn’t a statement about how things should be.
It’s like saying “a randomly chosen positive natural number is really big, so all numbers should be really big”. How do you see that differently?
We’ve already established that agents can have consistent preferences without adhering to what you think of as “universal human morality”. Child soldiers are human. Their preferences sure can be brutal, but they can be as internally consistent or inconsistent as those of anyone else. I sure would like to change their preferences, because I’d prefer for them to be different, not because some ‘idealised human spirit’ / ‘psychic unity of mankind’ ideal demands so.
Proof by demonstration? Well, lock yourself in a cellar with only water and send me a key, I’ll send it back FedEx with instructions to set you free, after a week. Would that suffice? I’d enjoy proving that I know my own utility function better than you know my utility function (now that would be quite weird), I wouldn’t enjoy the suffering. Who knows, might even be healthy overall.
You can’t randomly choose a positive natural number using an even distribution. If you use an uneven distribution, whether the result is likely to be big depends on how your distribution compares to your definition of “big”.
Choose from those positive numbers that a C++ int variable can contain, or any other* non-infinite subset of positive natural numbers, then. The point is the observation of “most numbers need more than 1 digit to be expressed” not implying in any way some sort of “need” for the 1-digit numbers to “change”, to satisfy the number fairy, or some abstract concept thereof.
* (For LW purposes: Any other? No, not any other. Choose one with a cardinality of at least 10^6. Heh.)
It is a statement about similarities, but it’s about a similarity that shapes what people should do. I don’t know how I can explain it without repeating myself, but I’ll try.
For an analogy, let’s consider beings that aren’t humans. Paperclip maximizers, for example. Except these paperclip maximizers aren’t AIs, but a species that somehow evolved biologically. They’re not perfect reasoners and can have internally inconsistent preferences. These paperclip maximizers can prefer to do something that isn’t paperclip-maximizing, even though that is contrary to their nature—that is, if they were to maximize paperclips, they would prefer it to whatever they were doing earlier. One day, a paperclip maximizer who is maximizing paperclips tells his fellow clippies, “You should maximize paperclips, because if you did, you would prefer to, as it is your nature”. This clippy’s statement is true—the clippies’ nature is such that if they maximized clippies, they would prefer it to other goals. So, regardless of what other clippies are actually doing, the utility-maximizing thing for them to do would be to maximize paperclips.
So it is with humans. Upon discovering/realizing/deriving what is moral and consistently acting/being moral, the agent would find that being moral is better than the alternative. This is in part due to human nature.
Agents, yes. Humans, no. Just like the clippies can’t have consistent preferences if they’re not maximizing paperclips.
What would that prove? Also, I don’t claim that I know the entirety of your utility function better than you do—you know much better than I do what kind of ice cream you prefer, what TV shows you like to watch, etc. But those have little to do with human nature in the sense that we’re talking about it here.
A clippy which isn’t maximizing paperclips is not a clippy.
A human which isn’t adhering to your moral codex is still a human.
That my utility function includes something which you’d probably consider immoral.
It’s a clippy because it would maximize paperclips if it had consistent preferences and sufficient knowledge.
I don’t dispute that this is possible. What I dispute is that your utility function would contain that if you were internally consistent (and had knowledge of what being moral is like).
The desires of an agent are defined by its preferences. “This is a paperclip maximizer which does not want to maximize paperclips” is a contradiction in terms. And what do you mean by “consistent”, do you mean “consistent with ‘human nature’? Who cares? Or consistent within themselves? Highly doubtful, what would internal consistency have to do with being an altruist? If there’s anything which is characteristic of “human nature”, it is the inconsistency of their preferences.
A human which doesn’t share what you think of as “correct” values (may I ask, not disparagingly, are you religious?) is still a human. An unusual one, maybe (probably not), but an agent not in “need” of any change towards more “moral” values. Stalin may have been happy the way he was.
Because of the warm fuzzies? The social signalling? Is being moral awsome, or deeply fulfilling? Are you internally consistent … ?
Call it a quasi-paperclip maximizer, then. I’m not interested in disputing definitions. Whatever you call it, it’s a being whose preferences are not necessarily internally consistent, but when they are, it prefers to maximize paperclips. When its preferences are internally inconsistent, it may prefer to do things and have goals other than maximizing paperclips.
There’s no necessary connection between the two, but I’m not equating morality and altruism. Morality is what one should do and/or how one should be, which need not be altruistic.
Humans can have incorrect values and still be human, but in that case they are internally inconsistent., because of the preferences they have due to human nature. I’m not saying that humans should strive to have human nature, I’m saying that they already have it. I doubt that Stalin was happy—just look at how paranoid he was. And no, I’m not religious, and have never been.
Yes to the first and third questions, Being moral is awesome and fulfilling. It makes you feel happier, more fulfilled, more stable, and similar feelings. It doesn’t guarantee happiness, but it contributes to it both directly (being moral feels good) and indirectly (it helps you make good decisions). It makes you stronger and more resilient (once you’ve internalized it fully). It’s hard to describe beyond that, but good feels good (TVTropes warning).
I think I’m internally consistent. I’ve been told that I am. It’s unlikely that I’m perfectly consistent, but whatever inconsistencies I have are probably minor. I’m open to having them addressed, whatever they are.
Claiming that Stalin wasn’t happy sounds like a variation of sour grapes where not only can you not be as successful as him, it would be actively uncomfortable for you to believe that someone who lacks compassion can be happy, so you claim that he’s not.
It’s true he was paranoid but it’s also true that in the real world, there are tradeoffs and you don’t see people becoming happy with no downsides whatsoever—claiming that this disqualifies them from being called happy eviscerates the word of meaning.
I’m also not convinced that Stalin’s “paranoia” was paranoia (it seems rationa for someone who doesn’t care about the welfare of others and can increase his safety by instilling fear and treating everyone as enemies to do so). I would also caution against assuming that since Stalin’s paranoia is prominent enough for you to have heard of it, it’s too big a deal for him to have been happy—it’s promiment enough for you to have heard of it because it was a big deal to the people affected by it, which is unrelated to how much it affected his happiness.
Stalin was paranoid even by the standards of major world leaders. Khrushchev wasn’t so paranoid, for example. Stalin saw enemies behind every corner. That is not a happy existence.
Khruschev was deposed. Stalin stayed dictator until he died of natural causes. That suggests that Khruschev wasn’t paranoid enough, while Stalin was appropriately paranoid.
Seeing enemies around every corner meant that sometimes he saw enemies that weren’t there, but it was overall adaptive because it resulted in him not getting defeated by any of the enemies that actually existed. (Furthermore, going against nonexistent enemies can be beneficial insofar as the ruthlessness in going after them discourages real enemies.)
How does the last sentence follow from the previous one? It’s certainly not as happy an existence as it would have been if he had no enemies, but as I pointed out, nobody’s perfectly happy. There are always tradeoffs and we don’t claim that the fact that someone had to do something to gain his happiness automatically makes that happiness fake.
Stalin’s paranoia, and the actions he took as a result, also created enemies, thus becoming a partly self-fulfilling attitude.
You do see people becoming happy with fewer downsides than others, though.
Stalin refused to believe Hitler would attack him, probably since that would be suicidally stupid on the attacker’s part. Was he paranoid, or did he update?
I’m not sure “preference” is a powerful enough term to capture an agent’s true goals, however defined. Consider any of the standard preference reversals: a heavy cigarette smoker, for example, might prefer to buy and consume their next pack in a Near context, yet prefer to quit in a Far. The apparent contradiction follows quite naturally from time discounting, yet neither interpretation of the person’s preferences is obviously wrong.
I’ve seen it used as shorthand for “utility function”, saving 5 keystrokes. That was the intended use here. Point taken, alternate phrasings welcome.
That would only prove that you think you want to do that. The issue is that what you think you want and what you actually want do not generally coincide, because of imperfect self-knowledge, bounded thinking time, etc.
I don’t know about child soldiers, but it’s fairly common for amateur philosophers to argue themselves into thinking they “should” be perfectly selfish egoists, or hedonistic utilitarians, because logic or rationality demands it. They are factually mistaken, and to the extent that they think they want to be egoists or hedonists, their “preferences” are inconsistent, because if they noticed the logical flaw in their argument they would change their minds.
Isn’t that when I throw up my arms and say “congratulations, your hypothesis is unfalsifiable, the dragon is permeable to fluor”. What experimental setup would you suggest? Would you say any statement about one’s preferences is moot? It seems that we’re always under bounded thinking time constraints. Maybe the paperclipper really wants to help humankind and be moral, and mistakingly thinks otherwise. Who would know, it optimized its own actions under resource constraints, and then there’s the ‘Löbstacle’ to consider.
Is saying “I like vanilla ice cream” FAI-complete and must never be uttered or relied upon by anyone?
Or argue themselves into thinking that there is some subset of preferences such every other (human?) agent should voluntarily choose to adopt them, against their better judgment (edit; as it contradicts what they (perceive, after thorough introspection) as their own preferences)? You can add “objective moralists” to the list.
What would it be that is present in every single human’s brain architecture throughout human history that would be compatible with some fixed ordering over actions, called “morally good”? (Otherwise you’d have your immediate counterexample.) The notion seems so obviously ill-defined and misguided (hence my first comment asking Cousin_It).
It’s fine (to me) to espouse preferences that aim to change other humans (say, towards being more altruistic, or towards being less altruistic, or whatever), but to appeal to some objective guiding principle based on “human nature” (which constantly evolves in different strands) or some well-sounding ev-psych applause-light is just a new substitute for the good old Abrahamic heavenly father.
I wouldn’t say any of those things. Obviously paperclippers don’t “really want to help humankind”, because they don’t have any human notion of morality built-in in the first place. Statements like “I like vanilla ice cream” are also more trustworthy on account of being a function of directly observable things like how you feel when you eat it.
The only point I’m trying to make here is that it is possible to be mistaken about your own utility function. It’s entirely consistent for the vast majority of humans to have a large shared portion of their built-in utility function (built-in by their genes), even though many of them seemingly want to do bad things, and that’s because humans are easily confused and not automatically self-aware.
For sure.
I’d agree if humans were like dishwashers. There are templates for dishwashers, ways they are supposed to work. If you came across a broken dishwasher, there could be a case for the dishwasher to be repaired, to go back to “what it’s supposed to be”.
However, that is because there is some external authority (exasparated humans who want to fix their damn dishwasher, dirty dishes are piling up) conceiving of and enforcing such a purpose. The fact that genes and the environment shape utility functions in similar ways is a description, not a prescription. It would not be a case for any “broken” human to go back to “what his genes would want him to be doing”. Just like it wouldn’t be a case against brain uploading.
Some of the discussion seems to me like saying that “deep down in every flawed human, there is ‘a figure of light’, in our community ‘a rational agent following uniform human values with slight deviations accounting for ice-cream taste’, we just need to dig it up”. There is only your brain. With its values. There is no external standard to call its values flawed. There are external standards (rationality = winning) to better its epistemic and instrumental rationality, but those can help the serial killer and the GiveWell activist equally. (Also, both of those can be ‘mistaken’ about their values.)
If you have a preference for morality, being moral is not doing away with that prrefence: it is allowing your altruistic prefences to override your selfish ones.
You may be on the receving end of someone else’s self sacrifice at some point
Certainly, but in that case your preference for the moral action is your personal preference, which is your ‘selfish’ preference. No conflict there. You should always do that which maximizes your utility function. If you call that moral, we’re in full agreement. If your utility function is maximized by caring about someone else’s utility function, go for it. I do, too.
That’s nice. Why would that cause me to do things which I do not overall prefer to do? Or do you say you always value that which you call moral the most?
I can make a quite clear distinction between my preferences relating to an apersonal loving-kindness towards the universe in general, and the preferences that center around my personal affections and likings.
You keep trying to do away with a distinction that has huge predictive ability: a distinction that helps determine what people do, why they do it, how they feel about doing it, and how they feel after doing it.
If your model of people’s psychology conflates morality and non-moral preferences, your model will be accurate only for the most amoral of people.
Morality is a somewhat like chess in this respect—morality:optimal play::satisfying your preferences:winning. To simplify their preferences a bit, chess players want to win, but no individual chess player would claim that all other chess players should play poorly so he can win.
That’s explained simply by ‘winning only against bad players’ not being the most valued component of their preferences, preferring ‘wins when the other player did his/her very best and still lost’ instead. Am I missing your point?
Sorry, I didn’t explain well. To approach the explanation from different angles:
Even if all chess players wanted was to win, it would still be incorrect for them to claim that playing poorly is the correct way to play. Just like when I’m hungry, I want to eat, but I don’t claim that strangers should feed me for free.
Consider the prisoners’ dilemma, as analyzed traditionally. Each prisoner wants the other to cooperate, but neither can claim that the other should cooperate.
Incorrect because that’s not what the winning player would prefer. You don’t claim that strangers should feed you because that’s what you prefer. It’s part of your preferences. Some of your preferences can rely on satisfying someone else’s preferences. Such altruistic preferences are still your own preferences. Helping members of your tribe you care about. Cooperating within your tribe, enjoying the evolutionary triggered endorphins.
You’re probably thinking that considering external preferences and incorporating them in your own utility function is a core principle of being “morally right”. Is that so?
So the core disagreement (I think): Take an agent with a given set of preferences. Some of these may include the preferences of others, some may not. On what basis should that agent modify its preferences to include more preferences of others, i.e. to be “more moral”?
So you can imagine yourself in someone else’s position, then say “What B should do from A’s perspective” is different from “What B should do from B’s perspective”. Then you can enter all sorts of game theoretic considerations. Where does morality come in?
There is no “What B should do from A’s perspective”, from A’s perspective there is only “What I want B to do”. It’s not a “should”. Similarly, the chess player wants his opponent to lose, and I want people to feed me, but neither of those are “should”s. “Should”s are only from an agent’s own perspective applied to themselves, or from something simulating that perspective (such as modeling the other player in a game). “What B should do from B’s perspective” is equivalent to “What B should do”.
The key issue is that, whilst morality is not tautologously the same as preferences, a morally right action is, tautologously, what you should do.
So it is difficult to see on what grounds Mark can object to the FAIs wishes: if it tells him something is mortally right that is what he should do. And he can’t have his own separate morality, because the idea is incoherent.
A distinction to be made: Mark can wish differently than the AI wishes, Mark can’t morally object to the AI’s wishes (if the AI follows morality).
Exactly because morality is not the same as preferences.
You can call a subset of your preferences moral, that’s fine. Say, eating chocolate icecream, or helping a starving child. Let’s take a randomly chosen “morally right action” A.
That, given your second paragraph, would have to be a preference which, what, maximizes Mark’s utility, regardless of what the rest of his utility function actually looks like?
It seems to be trivial to construct a utility function (given any such action A) such as that doing A does not maximize said utility function. Give Mark a such a utility function and you got yourself a reductio ad absurdum.
So, if you define a subset of preferences named “morally right” thus that any such action needs to maximize (edit: or even ‘not minimize’) an arbitrary utility function, then obviously that subset is empty.
If Mark is capable of acting morally, he would have a preference for moral action which is strong enough to override other preferences. However,t hat is not really the point. Even if he is too weak-willed to do what the FAI says, he has no grounds to object to the FAI.
I can’t see how that amount to more than the observation that not every agetn is capable of acting morally. Ho hum.
I don’t see why. An agent should want to do what is morally right, but that doesn’t mean an agent would want to. Their utility funciton might not allow them. But how could they object to be told what is right? The fault, surely, lies in themselves.
They can object because their preferences are defined by their utility function, full stop. That’s it. They are not “at fault”, or “in error”, for not adopting some other preferences that some other agents deem to be “morally correct”. They are following their programming, as you follow yours. Different groups of agents share different parts of their preferences, think Venn diagram.
If the oracle tells you “this action maximizes your own utility function, you cannot understand how”, then yes the agent should follow the advice.
If the oracle told an agent “do this, it is morally right”, the non-confused agent would ask “do you mean it maximizes my own utility function?”. If yes, “thanks, I’ll do that”, if no “go eff yourself!”.
You can call an agent “incapable of acting morally” because you don’t like what it’s doing, it needn’t care. It might just as well call you “incapable of acting morally” if your circles of supposedly “morally correct actions” don’t intersect.
I can’t speak for cousin_it, natch, but for my own part I think it has to do with mutually exclusive preferences vs orthogonal/mutually reinforcing preferences. Using moral language is a way of framing a preference as mutually exclusive with other preferences.
That is… if you want A and I want B, and I believe the larger system allows (Kawoomba gets A AND Dave gets B), I’m more likely to talk about our individual preferences. If I don’t think that’s possible, I’m more likely to use universal language (“moral,” “optimal,” “right,” etc.), in order to signal that there’s a conflict to be resolved. (Well, assuming I’m being honest.)
For example, “You like chocolate, I like vanilla” does not signal a conflict; “Chocolate is wrong, vanilla is right” does.
Why stop at connotation and signalling? If there is a non-empty set of preferences whose satistfaction is inclined to lead to conflict, and a non-empty set of preferences that can be satisfied withotu conflict, then “morally relevant prefernece” can denote the members of the first set...which is not idenitcal to the set of all preferences.
For any such preference, you can immediately provide a utility function such that the corresponding agent would be very unhappy about that preference, and would give its life to prevent it.
Or do you mean “a set of preferences the implementation of which would on balance benefit the largest amount of agents the most”? That would change as the set of agents changes, so does the “correct” morality change too, then?
Also, why should I or anyone else particular care about about such preferences (however you define them), especially as the “on average” doesn’t benefit me? Is it because evolutionary speaking, that’s how what evolved? What our mirror neurons lead us towards? Wouldn’t that just be a case of the naturalistic fallacy?
Sure. So what? Kids don’t like teachers and criminals don’t like the police..but they can’t object to them, because “entitiy X is stopping from doing bad things and making me do good things” is no (rational, adult) objection.
If being moral increases your utility, it increases your utility—what other sense of “benefitting me” is there?
If utility is the satisfaction of preferences, and you can have preferences that don’t benefit you (such as doing heroin), increasing your utility doesn’t necessarily benefit you.
If you can get utility out of paperclips, why can’t you get it out of heorin? You’re surely not saying that there is some sort of Objective utility that everyone ought to have in their UF’s?
You can get utility out of heroin if you prefer to use it, which is an example of “benefiting me” and utility not being synonymous. I don’t think there’s any objective utility function for all conceivable agents, but as you get more specific in the kinds of agents you consider (i.e. humans), there are commonalities in their utility functions, due to human nature. Also, there are sometimes inconsistencies between (for lack of better terminology) what people prefer and what they really prefer—that is, people can act and have a preference to act in ways that, if they were to act differently, they would prefer the different act.
(Kids—teachers), (criminals—police), so is “morally correct” defined by the most powerful agents, then?
And if being moral (whatever it may mean) does not?
Adult, rational objections are objections that other agents might feel impelled to do somehting about, and so are not just based on “I don’t like it”.”I don’t like it” is no objectio to “you should do your homework”, etc.
Then you would belong to the set of Immoral Agents, AKA Bad People.
“You should do your homework (… because it is in your own long-term best interest, you just can’t see that yet)” is in the interest of the kid, cf. an FAI telling you to do an action because it is in your interest. “You should jump out that window (… because it amuses me / because I call that morally good)” is not in your interest, you should not do that. In such cases, “I don’t like that” is the most pertinent objection and can stand all on its own.
Boo bad people! What if we encountered aliens with “immoral” preferences?
For my own part: denotationally, yes, I would understand “Do you prefer (that Dave eat) chocolate or vanilla ice cream?” and “Do you consider (Dave eating) chocolate ice cream or vanilla as the morally superior flavor for (Dave eating) ice cream?” as asking the same question.
Connotationally, of course, the latter has all kinds of (mostly ill-defined) baggage the former doesn’t.
My point was that trying to use a provably-boxed AI to do anything useful would probably not work, including trying to design unboxed FAI, not that we should design boxed FAI. I may have been pessemistic, see Stuart Armstrong’s proposal of reduced impact AI which sounds very similar to provably boxed AI but which might be used for just about everything including designing a FAI.
I think we might have different definitions of a boxed-AI. An AI that is literally not allowed to interact with the world at all isn’t terribly useful and it sounds like a problem at least as hard as all other kinds of FAI.
I just mean a normal dangerous AI that physically can’t interact with the outside world. Importantly it’s goal is to provably give the best output it possibly can if you give it a problem. So it won’t hide nanotech in your cure for alzheimers because that would be a less fit and more complicated solution than a simple chemical compound (you would have to judge solutions based on complexity though and verify them by a human or in a simulation first just in case.)
I don’t think most computers today have anywhere near enough processing power to simulate a full human brain. A human down to the molecular level is entirely out of the question. An AI on a modern computer, if it’s smarter than human at all, will get there by having faster serial processing or more efficient algorithms, not because it has massive raw computational power.
And you can always scale down the hardware or charge it utility for using more computing power than it needs, forcing it to be efficient or limiting it’s intelligence further. You don’t need to invoke the full power of super-intelligence for every problem and for your safety you probably shouldn’t.
A slightly bigger “large risk” than Pentashagon puts forward is that a provably boxed UFAI could indifferently give us information that results in yet another UFAI, just as unpredictable as itself (statistically speaking, it’s going to give us more unhelpful information than helpful, as Robb point out). Keep in mind I’m extrapolating here. At first you’d just be asking for mundane things like better transportation, cures for diseases, etc. If the UFAI’s mind is strange enough, and we’re lucky enough, then some of these things result in beneficial outcomes, politically motivating humans to continue asking it for things. Eventually we’re going to escalate to asking for a better AI, at which point we’ll get a crap-shoot.
An even bigger risk than that -though—is that if it’s especially Unfriendly, it may even do this intentionally, going so far as to pretend it’s friendly while bestowing us with data to make an AI even more Unfriendly AI than itself. So what do we do, box that AI as well, when it could potentially be even more devious than the one that already convinced us to make this one? Is it just boxes, all the way down? (spoilers: it isn’t, because we shouldn’t be taking any advice from boxed AIs in the first place)
The only use of a boxed AI is to verify that, yes, the programming path you went down is the wrong one, and resulted in an AI that was indifferent to our existence (and therefore has no incentive to hide its motives from us). Any positive outcome would be no better than an outcome where the AI was specifically Evil, because if we can’t tell the difference in the code prior to turning it on, we certainly wouldn’t be able to tell the difference afterward.