Relatedly, I’ve formed the tentative intuition that paper clip maximizers are very hard to build; in fact, harder to build than FAI. What you do get out of kludge superintelligences is probably just going to be the pure universal AI drives (or something like that), or possibly some sort of approximately objective convergent decision theoretic policy, perhaps dictated by the acausal economy.
The only really hard part about making a superintelligent paperclip maximizer is the superintelligence. If you think that specifying the goal of a Friendly AI is just as easy, then making a superintelligent FAI will be just as hard. The “fragility of value” thesis argues that Friendliness is significantly harder to specify, because there are many ways to go wrong.
I suspect that Buddhahood or something close is the only real attractor in mindspace that could be construed as reflectively consistent given the vector humanity seems to be on.
I don’t. Simple selfishness is definitely an attractor (in the sense that it’s an attitude that many people end up adopting), and it wouldn’t take much axiological surgery to make it reflectively consistent.
The only really hard part about making a superintelligent paperclip maximizer is the superintelligence. If you think that specifying the goal of a Friendly AI is just as easy, then making a superintelligent FAI will be just as hard. The “fragility of value” thesis argues that Friendliness is significantly harder to specify, because there are many ways to go wrong.
No, it takes a lot of work to specify paperclips, and thus it’s not as easy as just superintelligence. You need goal stability, a stable ontology of paperclips, et cetera. I’m positing that it’s easier to specify human values than to specify paperclips. I have a long list of considerations, but the most obvious is that paperclips have much higher Kolmogorov complexity than human values given some sort of universal prior with a sort of prefix you could pick up by looking at the laws of physics.
No, it takes a lot of work to specify paperclips, and thus it’s not as easy as just superintelligence.
Reference class confusion. “Paperclipper” refers to any “universe tiler”, not just one that tile with paperclips. Specifying paperclips in particular is hard. If you don’t care about exactly what gets tiled, it’s much easier.
goal stability
For well-understood goals, that’s easy. Just hardcode the goal. It’s making goals that can change in a useful way that’s hard. Part of the hardness of FAI is we don’t understand friendly, we don’t know what humans want, and we don’t know what’s good for humans, and any simplistic fixed goal will cut off our evolution.
the most obvious is that paperclips have much higher Kolmogorov complexity than human values
I don’t see how you could possibly believe that, except out of wishful thinking. Human values are contingent on our entire evolutionary history. My parochial values are contingent upon cultural history and my own personal history. Our values are not universal. Different types of creature will develop radically different values with only small points of contact and agreement.
Really? Isn’t editing one’s goal directly contrary to one’s goal? If an AI self-edits in such a way that its goal changes, it will predictably no longer be working towards that goal, and will thus not consider it a good idea to edit its goal.
It depends on how it decides whether or not changes are a good thing. If is trying out two utility functions- Ub for utility before and Ua for utility after- you need to be careful to ensure it doesn’t say “hey, Ua(x)>Ub(x), so I can make myself better off by switching to Ua!”.
Ensuring that doesn’t happen is not simple, because it requires stability throughout everything. There can’t be a section that decides to try being goalless, or go about resolving the goal in a different way (which is troublesome if you want it to cleverly use instrumental goals).
[edit] To be clearer, you need to not just have the goals be fixed and well-understood, but every part of everywhere else also needs to have a fixed and well-understood relationship to the goals (and a fixed and well-understood sense of understanding, and …). Most attempts to rewrite source code are not that well-planned.
“Paperclipper” refers to any “universe tiler”, not just one that tile with paperclips.
Well, a number of voices here do seem to believe that there is something instantiable which is the thing that humans actually want and will continue to want no matter how much they improve/modify.
Presumably those voices would not call something that tiles the universe with that thing a paperclipper, despite agreeing that it’s a universe tiler, at least technically.
I believe that most voices here think that there are conditions that humans actually want and that one of them is variety. This in no way implies a material tiling. Permanency of these conditions is more questionable, of course.
The only plausible attractive tiling with something material that I could see even a large minority agreeing to would be a computational substrate, “computronium”.
If we’re talking about conditions that don’t correlate well with particular classes of material things, sure.
That’s rarely true of the conditions I (or anyone else I know) value in real life, but I freely grant that real life isn’t a good reference class for these sorts of considerations, so that’s not evidence of much.
Still, failing a specific argument to the contrary, I would expect whatever conditions humans maximally value (assuming there is a stable referent for that concept, which I tend to doubt, but a lot of people seem to believe strongly) to implicitly define a class of objects that optimally satisfies those conditions. Even variety, assuming that it’s not the trivial condition of variety wherein anything is good as long as it’s new.
Computronium actually raises even more questions. It seems that unless one values the accurate knowledge (even when epistemically indistinguishable from the false belief) that one is not a simulation, the ideal result would be to devote all available mass-energy to computronium running simulated humans in a simulated optimal environment.
If we’re talking about conditions that don’t correlate well with particular classes of material things, sure.
I think that needs a bit of refinement. Having lots of food correlates quite well with food. Yet no one here wants to tile the universe with white rice. Other people are a necessity for a social circle, but again few want to tile the universe with humans (okay, there are some here that could be caricatured that way).
I think we all want to eventually have the entire universe able to be physically controlled by humanity (and/or its FAI guardians), but tiling implies a uniformity that we won’t want. Certainly not on a local scale, and probably not on a macroscale either.
Computronium actually raises even more questions.
Right, which is why I borught it up as one of the few reasonable counterexamples. Still, it’s the programming makes a difference between heaven and hell.
“Paperclipper” refers to any “universe tiler”, not just one that tile with paperclips. Specifying paperclips in particular is hard. If you don’t care about exactly what gets tiled, it’s much easier.
My arguments apply to most kinds of things that tile the universe with highly arbitrary things like paperclips, though they apply less to things that tile the universe with less arbitrary things like bignums or something. I do believe arbitrary universe tilers are easier than FAI; just not paperclippers.
For well-understood goals, that’s easy. Just hardcode the goal.
There are currently no well-understood goals, nor are there obvious ways of hardcoding goals, nor are hardcoded goals necessarily stable for self-modifying AIs. We don’t even know what a goal is, let alone do we know how to solve the grounding problem of specifying what a paperclip is (or a tree is, or what have you). With humans you get the nice trick of getting the AI to look back on the process that created it, or alternatively just use universal induction techniques to find optimization processes out there in the universe, which will also find humans. (Actually I think it would find mostly memes, not humans per se, but a lot of what humans care about is memes.)
Human values are contingent on our entire evolutionary history.
Paperclips are contingent on that, plus a whole bunch of random cultural stuff. Again, if we’re talking about universe tilers in general this does not apply.
Also, as a sort of appeal to authority, I’ve been working at the Singularity Institute for a year now, and have spent many many hours thinking about the problem of FAI (though admittedly I’ve given significantly less thought to how to build a paperclipper). If my intuitions are unsound, it is not for reasons that are intuitively obvious.
Paperclips are contingent on that, plus a whole bunch of random cultural stuff. Again, if we’re talking about universe tilers in general this does not apply.
Really? Maybe if you wanted it to be able to classify whether any arbitrary corner case count as a paperclip or not, but that isn’t required for a paperclipper. If you’re just giving it a shape to copy then I don’t see why that would be more than a hundred bytes or so—trivial compared to the optimizer.
If you’re just giving it a shape to copy then I don’t see why that would be more than a hundred bytes or so—trivial compared to the optimizer.
A hundred bytes in what language? I get the intuition, but it really seems to me like paper clips are really complex. There are lots of important qualities of paperclips that make them clippy that seem to me like they’d be very hard to get an AI to understand. You say you’re giving it a shape, but that shape is not at all easily defined. And its molecular structure might be important, and its size, and its density, and its ability to hold sheets of paper together… Shape and molecular component aren’t fundamental attributes of the universe that an AI would have a native language for. This is why we can’t just keep an oracle AI in a box—it turns out that our intuitive idea of what a box is is really hard to explain to a de novo AI. Paperclips are similar. And if the AI is smart enough to understand human concepts that well, then you should also be able to just type up CEV and give it that instead… CEV is easier to describe than a paperclip in that case, since CEV is already written up. (Edit: I mean a description of CEV is written up, not CEV. We’re still working on the latter.)
Well, here’s a recipe for a paperclip in English:
Make a wire, 10 cm long and 1mm in diameter, composed of an alloy of 99.8% iron and 0.2% carbon. Start at one end and bend it such that the segments from 2-2.5cm, 2.75-3.25cm, 5.25-5.75cm form half-circles, with all the bends in the same direction and forming an inward spiral (the end with the first bend is outside the third bend).
If you can understand human concepts, “paperclip” is sufficient to tell you about paperclips. Google “paperclip,” you get hundreds of millions of results.
“Understanding human concepts” may be hard, but understanding arbitrary concrete concepts seems harder than understanding arbitrary abstract concepts that take a long essay to write up and have only been written up by one person and use a number of other abstract concepts in this writeup.
Sample a million paperclips from different manufacturers. Automatically cluster to find the 10,000 that are most similar to each other. Anything is a paperclip if it is more similar to one of those along every analytical dimension than another in the set.
Very easy, requires no human values. The AI is free to come up with strange dimensions along which to analyze paperclips, but it now has a clearly defined concept of paperclip, some (minimal) flexibility in designing them, and no need for human values.
Sample a million people from different continents. Automatically cluster to find the 10,000 that are most similar to each other. Anything is a person if it is more similar to one of those along every analytical dimension than another in the set.
This is already tripping majoritarianism alarm bells.
I would meditate on this for a while when trying to define a paperclip.
TheOtherDave got it right; I’m wasn’t trying to give a complete definition of what is and isn’t a paperclip, I was just offering forth an easy to define (without human values) subset that we would still call paperclips.
It has plenty of false negatives, but I don’t really see that as a loss. Likewise, your personhood algorithm doesn’t bother me as long as we don’t use it to establish non-personhood.
For that matter, one could play all kinds of Hofstadterian games along these lines… is a staple a paperclip? After all, it’s a thin piece of shaped metal designed and used to hold several sheets of paper together. Is a one-pound figurine of a paperclip a paperclip? Does it matter if you use it as a paperweight, to hold several pieces of paper together? Is a directory on a computer file system a virtual paperclip? Would it be more of one if we’d used the paperclip metaphor rather than the folder metaphor for it? And on and on and on.
In any case, I agree that the intuition that paperclips are an easy set to define depends heavily on the idea that not very many differences among candidates for inclusion matter very much, and that it should be obvious which differences those are. And all of that depends on human values.
Put a different way: picking 10,000 manufactured paperclips and fitting a category definition to those might exclude any number of things that, if asked, we would judge to be paperclips… but we don’t really care, so a category arrived at this way is good enough. Adopting the same approach to humans would similarly exclude things we would judge to be human… and we care a lot about that, at least sometimes.
There are currently no well-understood goals, nor are there obvious ways of hardcoding goals
“Find a valid proof of this theorem from the axioms of ZFC”. This goal is pretty well-understood, and I don’t believe an AI with such a goal will converge on Buddha. Or am I misunderstanding your position?
This doesn’t take into account logical uncertainty. It’s easy to write a program that eventually computes the answer you want, and then pose a question of doing that more efficiently while provably retaining the same goal, which is essentially what you cited, with respect to a brute force classical inference system starting from ZFC and enumerating all theorems (and even this has its problems, as you know, since the agent could be controlling which answer is correct). A far more interesting question is which answer to name when you don’t have time to find the correct answer. “Correct” is merely a heuristic for when you have enough time to reflect on what to do.
(Also, even to prove theorems, you need operating hardware, and manging that hardware and other actions in the world would require decision-making under (logical) uncertainty. Even nontrivial self-optimization would require decision-making under uncertainty that has a “chance” of turning you from the correct question.)
Try to formalize this intuition. With provably correct answers, that’s easy. Here, you need a notion of “best answer I’ve got”, a way of comparing possible answers where correctness remains inaccessible. This makes it “more interesting”: where the first problem is solved (to an extent), this one isn’t.
I don’t. Simple selfishness is definitely an attractor (in the sense that it’s an attitude that many people end up adopting), and it wouldn’t take much axiological surgery to make it reflectively consistent.
Individual humans sometimes become more selfish, but not consistently reflectively so, and humanity seems to be becoming more humane over time. Obviously there’s a lot of interpersonal and even intrapersonal variance, but the trend in human values is both intuitively apparent and empirically verified by e.g. the World Values Survey and other axiological sociology. Also, I doubt selfishness is as strong an attractor among people who are the smartest and most knowledgeable of their time. Look at e.g. Maslow’s research on people at the peak of human performance and mental health, and the attractors he identified as self actualization and self transcendence. Selfishness (or simple/naive selfishness) mostly seems like a pitfall for stereotypical amateur philosophers and venture capitalists.
I’m taking another look at this and find it hard to sum up just how many problems there are with your argument.
I doubt selfishness is as strong an attractor among people who are the smartest and most knowledgeable of their time.
What about the people who are the most powerful of their time? Think about what the psychology of a billionaire must be. You don’t accumulate that much wealth just by setting out to serve humanity. You care about offering your customers a service, but you also try to kill the competition, and you cut deals with the already existing powers, especially the state. Most adults are slaves, to an economic function if not literally doing what another person tells them, and then there is a small wealthy class of masters who have the desire and ability to take advantage of this situation.
I started out by opposing “simple selfishness” to your hypothesis that “Buddhahood or something close” is the natural endpoint of human moral development. But there’s also group allegiance: my family, my country, my race, but not yours. I look out for my group, it looks out for me, and caring about other groups is a luxury for those who are really well off. Such caring is also likely to be pursued in a form which is advantageous, whether blatantly or subtly, for the group which gets to play benefactor. We will reshape you even while we care for you.
Individual humans sometimes become more selfish, but not consistently reflectively so
How close do you think anyone has ever come to reflective consistency? Anyway, you are reflectively consistent if there’s no impulse within you to change your goals. So anyone, whatever their current goals, can achieve reflective consistency by removing whatever impulses for change-of-values they may have.
the only real attractor in mindspace that could be construed as reflectively consistent given the vector humanity seems to be on.
Reflective consistency isn’t a matter of consistency with your trajectory so far, it’s a matter of consistency when examined according to your normative principles. The trajectory so far did not result from any such thorough and transparent self-scrutiny.
Frankly, if I ask myself, what does the average human want to be, I’d say a benevolent dictator. So yes, the trend of increasing humaneness corresponds to something—increased opportunity to take mercy on other beings. But there’s no corresponding diminution of interest in satisfying one’s own desires.
Let’s see, what else can I disagree with? I don’t really know what your concept of Buddhahood is, but it sounds a bit like nonattachment for the sake of pleasure. I’ll take what pleasures I can, and I’ll avoid the pain of losing them by not being attached to them. But that’s aestheticism or rational hedonism. My understanding of Buddhahood is somewhat harsher (to a pleasure-seeking sensibility), because it seeks to avoid pleasure as well as pain, the goal after all being extinction, removal from the cycle of life. But that was never a successful mass philosophy, so you got more superstitious forms of Buddhism in which there’s a happy pure-land afterlife and so on.
I also have to note that an AI does not have to be a person, so it’s questionable what implications trends in human values have for AI. What people want themselves to be and what they would want a non-person AI to be are different topics.
Frankly, if I ask myself, what does the average human want to be, I’d say a benevolent dictator.
Seriously? For my part, I doubt that a typical human wants to do that much work. I suspect that “favored and privileged subject of a benevolent dictator” would be much more popular. Even more popular would be “favored and privileged beneficiary of a benevolent system without a superior peer.”
But agreed that none of this implies a reduced interest in having one’s desires satisfied.
(ETA: And, I should note, I agree with your main point about nonuniversality of drives.)
Immediately after I posted that, I doubted it. A lot of people might just want autonomy—freedom from dependency on others and freedom from the control of others. Dictator of yourself, but not dictator of humanity as a whole. Though one should not underestimate the extent to which human desire is about other people.
Will Newsome is talking about—or I thought he was talking about—value systems that would be stable in a situation where human beings have superintelligence working on their side. That’s a scenario where domination should become easy and without costs, so if people with a desire to rule had that level of power, the only thing to stop them from reshaping everyone else would be their own scruples about doing so; and even if they were troubled in that way, what’s to stop them from first reshaping themselves so as to be guiltless rulers of the world?
Also, even if we suppose that that outcome, while stable, is not what anyone would really want, if they first spent half an eternity in self-optimization limbo investigating the structure of their personal utility function… I remain skeptical that “Buddhahood” is the universal true attractor, though it’s hard to tell without knowing exactly what connotations Will would like to convey through his use of the term.
I am skeptical about universal attractors in general, including but not limited to Buddhahood and domination. (Psychological ones, anyway. I suppose entropy is a universal attractor in some trivial sense.) I’m also inclined to doubt that anything is a stable choice, either in the sense you describe here, or in the sense of not palling after a time of experiencing it.
Of course, if human desires are editable, then anything can be a stable choice: just modify the person’s desires such that they never want anything else. By the same token, anything can be a universal attractor: just modify everyone’s desires so they choose it. These seem like uninteresting boundary cases.
I agree that some humans would, given the option, choose domination. I suspect that’s <1% of the population given a range of options, though rather more if the choice is “dominate or be dominated.” (Although I suspect most people would choose to try it out for a while, if that were an option, then would give it up in less than a year.)
I suspect about the same percentage would choose to be dominated as a long-term lifestyle choice, given the expectation that they can quit whenever they want.
I agree that some would choose autonomy, though again I suspect not that many (<5%, say) would choose it for any length of time.
I suspect the majority of humans would choose some form of interdependency, if that were an option.
Relatedly, I’ve formed the tentative intuition that paper clip maximizers are very hard to build; in fact, harder to build than FAI. What you do get out of kludge superintelligences is probably just going to be the pure universal AI drives (or something like that), or possibly some sort of approximately objective convergent decision theoretic policy, perhaps dictated by the acausal economy.
The only really hard part about making a superintelligent paperclip maximizer is the superintelligence. If you think that specifying the goal of a Friendly AI is just as easy, then making a superintelligent FAI will be just as hard. The “fragility of value” thesis argues that Friendliness is significantly harder to specify, because there are many ways to go wrong.
I don’t. Simple selfishness is definitely an attractor (in the sense that it’s an attitude that many people end up adopting), and it wouldn’t take much axiological surgery to make it reflectively consistent.
No, it takes a lot of work to specify paperclips, and thus it’s not as easy as just superintelligence. You need goal stability, a stable ontology of paperclips, et cetera. I’m positing that it’s easier to specify human values than to specify paperclips. I have a long list of considerations, but the most obvious is that paperclips have much higher Kolmogorov complexity than human values given some sort of universal prior with a sort of prefix you could pick up by looking at the laws of physics.
I agree with your first sentence, but I suspect that building a friendly AI is harder than building a paperclipper.
Reference class confusion. “Paperclipper” refers to any “universe tiler”, not just one that tile with paperclips. Specifying paperclips in particular is hard. If you don’t care about exactly what gets tiled, it’s much easier.
For well-understood goals, that’s easy. Just hardcode the goal. It’s making goals that can change in a useful way that’s hard. Part of the hardness of FAI is we don’t understand friendly, we don’t know what humans want, and we don’t know what’s good for humans, and any simplistic fixed goal will cut off our evolution.
I don’t see how you could possibly believe that, except out of wishful thinking. Human values are contingent on our entire evolutionary history. My parochial values are contingent upon cultural history and my own personal history. Our values are not universal. Different types of creature will develop radically different values with only small points of contact and agreement.
Hardcoding is not necessarily stable in programs that can edit their own source code.
Really? Isn’t editing one’s goal directly contrary to one’s goal? If an AI self-edits in such a way that its goal changes, it will predictably no longer be working towards that goal, and will thus not consider it a good idea to edit its goal.
It depends on how it decides whether or not changes are a good thing. If is trying out two utility functions- Ub for utility before and Ua for utility after- you need to be careful to ensure it doesn’t say “hey, Ua(x)>Ub(x), so I can make myself better off by switching to Ua!”.
Ensuring that doesn’t happen is not simple, because it requires stability throughout everything. There can’t be a section that decides to try being goalless, or go about resolving the goal in a different way (which is troublesome if you want it to cleverly use instrumental goals).
[edit] To be clearer, you need to not just have the goals be fixed and well-understood, but every part of everywhere else also needs to have a fixed and well-understood relationship to the goals (and a fixed and well-understood sense of understanding, and …). Most attempts to rewrite source code are not that well-planned.
Well, a number of voices here do seem to believe that there is something instantiable which is the thing that humans actually want and will continue to want no matter how much they improve/modify.
Presumably those voices would not call something that tiles the universe with that thing a paperclipper, despite agreeing that it’s a universe tiler, at least technically.
I believe that most voices here think that there are conditions that humans actually want and that one of them is variety. This in no way implies a material tiling. Permanency of these conditions is more questionable, of course.
The only plausible attractive tiling with something material that I could see even a large minority agreeing to would be a computational substrate, “computronium”.
If we’re talking about conditions that don’t correlate well with particular classes of material things, sure.
That’s rarely true of the conditions I (or anyone else I know) value in real life, but I freely grant that real life isn’t a good reference class for these sorts of considerations, so that’s not evidence of much.
Still, failing a specific argument to the contrary, I would expect whatever conditions humans maximally value (assuming there is a stable referent for that concept, which I tend to doubt, but a lot of people seem to believe strongly) to implicitly define a class of objects that optimally satisfies those conditions. Even variety, assuming that it’s not the trivial condition of variety wherein anything is good as long as it’s new.
Computronium actually raises even more questions. It seems that unless one values the accurate knowledge (even when epistemically indistinguishable from the false belief) that one is not a simulation, the ideal result would be to devote all available mass-energy to computronium running simulated humans in a simulated optimal environment.
I think that needs a bit of refinement. Having lots of food correlates quite well with food. Yet no one here wants to tile the universe with white rice. Other people are a necessity for a social circle, but again few want to tile the universe with humans (okay, there are some here that could be caricatured that way).
I think we all want to eventually have the entire universe able to be physically controlled by humanity (and/or its FAI guardians), but tiling implies a uniformity that we won’t want. Certainly not on a local scale, and probably not on a macroscale either.
Right, which is why I borught it up as one of the few reasonable counterexamples. Still, it’s the programming makes a difference between heaven and hell.
My arguments apply to most kinds of things that tile the universe with highly arbitrary things like paperclips, though they apply less to things that tile the universe with less arbitrary things like bignums or something. I do believe arbitrary universe tilers are easier than FAI; just not paperclippers.
There are currently no well-understood goals, nor are there obvious ways of hardcoding goals, nor are hardcoded goals necessarily stable for self-modifying AIs. We don’t even know what a goal is, let alone do we know how to solve the grounding problem of specifying what a paperclip is (or a tree is, or what have you). With humans you get the nice trick of getting the AI to look back on the process that created it, or alternatively just use universal induction techniques to find optimization processes out there in the universe, which will also find humans. (Actually I think it would find mostly memes, not humans per se, but a lot of what humans care about is memes.)
Paperclips are contingent on that, plus a whole bunch of random cultural stuff. Again, if we’re talking about universe tilers in general this does not apply.
Also, as a sort of appeal to authority, I’ve been working at the Singularity Institute for a year now, and have spent many many hours thinking about the problem of FAI (though admittedly I’ve given significantly less thought to how to build a paperclipper). If my intuitions are unsound, it is not for reasons that are intuitively obvious.
Really? Maybe if you wanted it to be able to classify whether any arbitrary corner case count as a paperclip or not, but that isn’t required for a paperclipper. If you’re just giving it a shape to copy then I don’t see why that would be more than a hundred bytes or so—trivial compared to the optimizer.
A hundred bytes in what language? I get the intuition, but it really seems to me like paper clips are really complex. There are lots of important qualities of paperclips that make them clippy that seem to me like they’d be very hard to get an AI to understand. You say you’re giving it a shape, but that shape is not at all easily defined. And its molecular structure might be important, and its size, and its density, and its ability to hold sheets of paper together… Shape and molecular component aren’t fundamental attributes of the universe that an AI would have a native language for. This is why we can’t just keep an oracle AI in a box—it turns out that our intuitive idea of what a box is is really hard to explain to a de novo AI. Paperclips are similar. And if the AI is smart enough to understand human concepts that well, then you should also be able to just type up CEV and give it that instead… CEV is easier to describe than a paperclip in that case, since CEV is already written up. (Edit: I mean a description of CEV is written up, not CEV. We’re still working on the latter.)
Well, here’s a recipe for a paperclip in English: Make a wire, 10 cm long and 1mm in diameter, composed of an alloy of 99.8% iron and 0.2% carbon. Start at one end and bend it such that the segments from 2-2.5cm, 2.75-3.25cm, 5.25-5.75cm form half-circles, with all the bends in the same direction and forming an inward spiral (the end with the first bend is outside the third bend).
If you can understand human concepts, “paperclip” is sufficient to tell you about paperclips. Google “paperclip,” you get hundreds of millions of results.
“Understanding human concepts” may be hard, but understanding arbitrary concrete concepts seems harder than understanding arbitrary abstract concepts that take a long essay to write up and have only been written up by one person and use a number of other abstract concepts in this writeup.
Do you mean easier here?
Yes.
Sample a million paperclips from different manufacturers. Automatically cluster to find the 10,000 that are most similar to each other. Anything is a paperclip if it is more similar to one of those along every analytical dimension than another in the set.
Very easy, requires no human values. The AI is free to come up with strange dimensions along which to analyze paperclips, but it now has a clearly defined concept of paperclip, some (minimal) flexibility in designing them, and no need for human values.
Counter:
Sample a million people from different continents. Automatically cluster to find the 10,000 that are most similar to each other. Anything is a person if it is more similar to one of those along every analytical dimension than another in the set.
This is already tripping majoritarianism alarm bells.
I would meditate on this for a while when trying to define a paperclip.
TheOtherDave got it right; I’m wasn’t trying to give a complete definition of what is and isn’t a paperclip, I was just offering forth an easy to define (without human values) subset that we would still call paperclips.
It has plenty of false negatives, but I don’t really see that as a loss. Likewise, your personhood algorithm doesn’t bother me as long as we don’t use it to establish non-personhood.
Ah. This distinction escaped me; I tend to use definitions in a formal logical style (P or ~P, no other options).
Sure.
For that matter, one could play all kinds of Hofstadterian games along these lines… is a staple a paperclip? After all, it’s a thin piece of shaped metal designed and used to hold several sheets of paper together. Is a one-pound figurine of a paperclip a paperclip? Does it matter if you use it as a paperweight, to hold several pieces of paper together? Is a directory on a computer file system a virtual paperclip? Would it be more of one if we’d used the paperclip metaphor rather than the folder metaphor for it? And on and on and on.
In any case, I agree that the intuition that paperclips are an easy set to define depends heavily on the idea that not very many differences among candidates for inclusion matter very much, and that it should be obvious which differences those are. And all of that depends on human values.
Put a different way: picking 10,000 manufactured paperclips and fitting a category definition to those might exclude any number of things that, if asked, we would judge to be paperclips… but we don’t really care, so a category arrived at this way is good enough. Adopting the same approach to humans would similarly exclude things we would judge to be human… and we care a lot about that, at least sometimes.
“Find a valid proof of this theorem from the axioms of ZFC”. This goal is pretty well-understood, and I don’t believe an AI with such a goal will converge on Buddha. Or am I misunderstanding your position?
This doesn’t take into account logical uncertainty. It’s easy to write a program that eventually computes the answer you want, and then pose a question of doing that more efficiently while provably retaining the same goal, which is essentially what you cited, with respect to a brute force classical inference system starting from ZFC and enumerating all theorems (and even this has its problems, as you know, since the agent could be controlling which answer is correct). A far more interesting question is which answer to name when you don’t have time to find the correct answer. “Correct” is merely a heuristic for when you have enough time to reflect on what to do.
(Also, even to prove theorems, you need operating hardware, and manging that hardware and other actions in the world would require decision-making under (logical) uncertainty. Even nontrivial self-optimization would require decision-making under uncertainty that has a “chance” of turning you from the correct question.)
What’s more interesting about it? Think for some time and then output the best answer you’ve got.
Try to formalize this intuition. With provably correct answers, that’s easy. Here, you need a notion of “best answer I’ve got”, a way of comparing possible answers where correctness remains inaccessible. This makes it “more interesting”: where the first problem is solved (to an extent), this one isn’t.
Individual humans sometimes become more selfish, but not consistently reflectively so, and humanity seems to be becoming more humane over time. Obviously there’s a lot of interpersonal and even intrapersonal variance, but the trend in human values is both intuitively apparent and empirically verified by e.g. the World Values Survey and other axiological sociology. Also, I doubt selfishness is as strong an attractor among people who are the smartest and most knowledgeable of their time. Look at e.g. Maslow’s research on people at the peak of human performance and mental health, and the attractors he identified as self actualization and self transcendence. Selfishness (or simple/naive selfishness) mostly seems like a pitfall for stereotypical amateur philosophers and venture capitalists.
I’m taking another look at this and find it hard to sum up just how many problems there are with your argument.
What about the people who are the most powerful of their time? Think about what the psychology of a billionaire must be. You don’t accumulate that much wealth just by setting out to serve humanity. You care about offering your customers a service, but you also try to kill the competition, and you cut deals with the already existing powers, especially the state. Most adults are slaves, to an economic function if not literally doing what another person tells them, and then there is a small wealthy class of masters who have the desire and ability to take advantage of this situation.
I started out by opposing “simple selfishness” to your hypothesis that “Buddhahood or something close” is the natural endpoint of human moral development. But there’s also group allegiance: my family, my country, my race, but not yours. I look out for my group, it looks out for me, and caring about other groups is a luxury for those who are really well off. Such caring is also likely to be pursued in a form which is advantageous, whether blatantly or subtly, for the group which gets to play benefactor. We will reshape you even while we care for you.
How close do you think anyone has ever come to reflective consistency? Anyway, you are reflectively consistent if there’s no impulse within you to change your goals. So anyone, whatever their current goals, can achieve reflective consistency by removing whatever impulses for change-of-values they may have.
Reflective consistency isn’t a matter of consistency with your trajectory so far, it’s a matter of consistency when examined according to your normative principles. The trajectory so far did not result from any such thorough and transparent self-scrutiny.
Frankly, if I ask myself, what does the average human want to be, I’d say a benevolent dictator. So yes, the trend of increasing humaneness corresponds to something—increased opportunity to take mercy on other beings. But there’s no corresponding diminution of interest in satisfying one’s own desires.
Let’s see, what else can I disagree with? I don’t really know what your concept of Buddhahood is, but it sounds a bit like nonattachment for the sake of pleasure. I’ll take what pleasures I can, and I’ll avoid the pain of losing them by not being attached to them. But that’s aestheticism or rational hedonism. My understanding of Buddhahood is somewhat harsher (to a pleasure-seeking sensibility), because it seeks to avoid pleasure as well as pain, the goal after all being extinction, removal from the cycle of life. But that was never a successful mass philosophy, so you got more superstitious forms of Buddhism in which there’s a happy pure-land afterlife and so on.
I also have to note that an AI does not have to be a person, so it’s questionable what implications trends in human values have for AI. What people want themselves to be and what they would want a non-person AI to be are different topics.
Seriously? For my part, I doubt that a typical human wants to do that much work. I suspect that “favored and privileged subject of a benevolent dictator” would be much more popular. Even more popular would be “favored and privileged beneficiary of a benevolent system without a superior peer.”
But agreed that none of this implies a reduced interest in having one’s desires satisfied.
(ETA: And, I should note, I agree with your main point about nonuniversality of drives.)
Immediately after I posted that, I doubted it. A lot of people might just want autonomy—freedom from dependency on others and freedom from the control of others. Dictator of yourself, but not dictator of humanity as a whole. Though one should not underestimate the extent to which human desire is about other people.
Will Newsome is talking about—or I thought he was talking about—value systems that would be stable in a situation where human beings have superintelligence working on their side. That’s a scenario where domination should become easy and without costs, so if people with a desire to rule had that level of power, the only thing to stop them from reshaping everyone else would be their own scruples about doing so; and even if they were troubled in that way, what’s to stop them from first reshaping themselves so as to be guiltless rulers of the world?
Also, even if we suppose that that outcome, while stable, is not what anyone would really want, if they first spent half an eternity in self-optimization limbo investigating the structure of their personal utility function… I remain skeptical that “Buddhahood” is the universal true attractor, though it’s hard to tell without knowing exactly what connotations Will would like to convey through his use of the term.
I am skeptical about universal attractors in general, including but not limited to Buddhahood and domination. (Psychological ones, anyway. I suppose entropy is a universal attractor in some trivial sense.) I’m also inclined to doubt that anything is a stable choice, either in the sense you describe here, or in the sense of not palling after a time of experiencing it.
Of course, if human desires are editable, then anything can be a stable choice: just modify the person’s desires such that they never want anything else. By the same token, anything can be a universal attractor: just modify everyone’s desires so they choose it. These seem like uninteresting boundary cases.
I agree that some humans would, given the option, choose domination. I suspect that’s <1% of the population given a range of options, though rather more if the choice is “dominate or be dominated.” (Although I suspect most people would choose to try it out for a while, if that were an option, then would give it up in less than a year.)
I suspect about the same percentage would choose to be dominated as a long-term lifestyle choice, given the expectation that they can quit whenever they want.
I agree that some would choose autonomy, though again I suspect not that many (<5%, say) would choose it for any length of time.
I suspect the majority of humans would choose some form of interdependency, if that were an option.
Entropy is the lack of an identifiable attractor.