What on Earth could someone possibly be thinking, when they propose creating a superintelligence whose behaviors are reinforced by human smiles?
Tiny molecular photographs of human smiles—or if you rule that out,
then faces ripped off and permanently wired into smiles—or if you
rule that out, then brains stimulated into permanent maximum happiness,
in whichever way results in the widest smiles...
Well, you never do know what other people are thinking, but in this
case I’m willing to make a guess. It has to do with a field of
cognitive psychology called Qualitative Reasoning.
Qualitative reasoning is what you use to decide that increasing the
temperature of your burner increases the rate at which your water
boils, which decreases the derivative of the amount of water present.
One would also add the sign of d(water) - negative, meaning that the
amount of water is decreasing—and perhaps the fact that there is only
a bounded amount of water. Or we could say that turning up the burner
increases the rate at which the water temperature increases, until the
water temperature goes over a fixed threshold, at which point the water
starts boiling, and hence decreasing in quantity… etc.
That’s qualitative reasoning, a small subfield of cognitive science
and Artificial Intelligence—reasoning that doesn’t describe or
predict exact quantities, but rather the signs of quantities, their
derivatives, the existence of thresholds.
As usual, human common sense means we can see things by qualitative
reasoning that current programs can’t—but the more interesting
realization is how vital human qualitative reasoning is to our vaunted human common sense. It’s one of the basic ways in which we comprehend the world.
Without timers you can’t figure out how long water takes to boil,
your mind isn’t that precise. But you can figure out that you should
turn the burner up, rather than down, and then watch to make sure the water doesn’t all boil away. Which is what you mainly need, in the real world. Or at least we humans seem to get by on qualitative reasoning; we may not realize what we’re missing...
So I suspect that what went through the one’s mind, proposing the AI
whose behaviors would be reinforced by human smiles, was something like
The happier people are, the more they smile. Smiles reinforce the
behavior of the AI, so it does more of whatever makes people happy.
Being happy is good (that’s what the positive connection to “utility”
is about). Therefore this is a good AI to construct, because more
people will be happy, and that’s better. Switch the AI right on!
How many problems are there with this reasoning?
Let us count the ways...
In fact, if you’re interested in the field, you should probably try
counting the ways yourself, before I continue. And score yourself on
how deeply you stated a problem, not just the number of specific cases.
Problem 1: There are ways to cause smiles besides
happiness. “What causes a smile?” “Happiness.” That’s the prototype
event, the one that comes first to memory. But even in human affairs,
you might be able to think of some cases where smiles result from a
cause other than happiness.
Where a superintelligence is involved—even granting the hypothesis
that it “wants smiles” or “executes behaviors reinforced by smiles” -
then you’re suddenly much more likely to be dealing with causes of
smiles that are outside the human norm. Back in hunter-gatherer
society, the main cause of eating food was that you hunted it or
gathered it. Then came agriculture and domesticated animals. Today,
some hospital patients are sustained by IVs or tubes, and at least a
few of the vitamins or minerals in the mix may be purely synthetic.
A creative mind, faced with a goal state, tends to invent new ways of achieving it—new causes
of the goal’s achievement. It invents techniques that are faster or
more reliable or less resource-intensive or with bigger wins. Consider
how creative human beings are about obtaining money, and how many more
ways there are to obtain money today than a few thousand years ago when
money was first invented.
One of the ways of viewing our amazing human ability of “general
intelligence” (or “significantly more generally applicable than
chimpanzee intelligence”) is that it operates across domains and can find new domains to exploit.
You can see this in terms of learning new and unsuspected facts about
the universe, and in terms of searching paths through time that wend
through these new facts. A superintelligence would be more effective
on both counts—but even on a human scale, this is why merely human
progress, thinking with 200Hz neurons over a few hundred years, tends
to change the way we do things and not just do the same things more effectively.
As a result, a “weapon” today is not like a weapon of yestercentury,
“long-distance communication today” is not a letter carried by horses
and ships.
So when the AI is young, it can only obtain smiles by making the
people around it happy. When the AI grows up to superintelligence, it
makes its own nanotechnology and then starts manufacturing the most
cost-effective kind of object that it has deemed to be a smile.
In general, a lot of naive-FAI plans I see proposed, have the
property that, if actually implemented, the strategy might appear to
work while the AI was dumber-than-human, but would fail when the AI was
smarter than human. The fully general reason for this is that while
the AI is dumber-than-human, it may not yet be powerful enough to
create the exceptional conditions that will break the neat little
flowchart that would work if every link operated according to the 21st-century First-World modal event.
This is why, when you encounter the AGI wannabe who hasn’t planned
out a whole technical approach to FAI, and confront them with the
problem for the first time, and they say, “Oh, we’ll test it to make
sure that doesn’t happen, and if any problem like that turns up we’ll
correct it, now let me get back to the part of the problem that really
interests me,” know then that this one has not yet leveled up high
enough to have interesting opinions. It is a general point about
failures in bad FAI strategies, that quite a few of them don’t show up
while the AI is in the infrahuman regime, and only show up once the
strategy has gotten into the transhuman regime where it is too late to
do anything about it.
Indeed, according to Bill Hibbard’s actual proposal, where the AI is reinforced by seeing
smiles, the FAI strategy would be expected to short out—from our
perspective, from the AI’s perspective it’s being brilliantly creative
and thinking outside the box for massive utility wins—to short out on
the AI taking control of its own sensory instrumentation and feeding
itself lots of smile-pictures. For it to keep doing this, and do it as
much as possible, it must of course acquire as many resources as
So! Let us repair our design as follows, then:
Now the AI is not being rewarded by any particular sensory input -
on which the FAI strategy would presumably short out—but is, rather,
trying to maximize an external and environmental quantity, the amount of happiness out there.
This already takes us into the realm of technical expertise -
distinctions that can’t be understood in just English, like the
difference between expected utility maximization (which can be over
external environmental properties that are modeled but not directly
sensed) and reinforcement learning (which is inherently tied directly
to sensors). See e.g. Terminal Values and Instrumental Values.
So in this case, then, the sensors give the AI information that it uses to infer a model of
the world; the possible consequences of various plans are modeled, and
the amount of “happiness” in that model summed by a utility function;
and whichever plan corresponds to the greatest expectation of
“happiness”, that plan is output as actual actions.
Or in simpler language: The AI uses its sensors to find out what
the world is like, and then it uses its actuators to make sure the
world contains as much happiness as possible. Happiness is good,
therefore it is good to turn on this AI.
What could possibly go wrong?
Problem 2: What exactly does the AI consider to be happiness?
Does the AI’s model of a tiny little Super Happy Agent (consisting mostly of a reward center that represents a large number) meet the definition of “happiness” that the AI’s utility function sums over, when it looks over the modeled consequences of its actions?
I’m not going to reprise the full discussion in Magical Categories,
but a sample set of things that the human labels “happy” or “not happy”
is likely to miss out on key dimensions of possible variances, and
never wend through labeling-influencing factors that would be important
if they were invoked. Which is to say: Did you think of presenting the AI with the tiny Super Happy Agent, when you’ve never seen such a thing? Did you think of
discussing chimpanzees, Down Syndrome children, and Terry Schiavo? How
late would it have been, in humanity’s technological development,
before any human being could have and would have thought of the possibilities you’re now generating? (Note opportunity for hindsight bias.)
Indeed, once you start talking about how we would label new
borderline cases we’ve never seen, you’re well into the realm of
extrapolating volitions—you might as well ask how we would label
these cases, if we knew everything the AI knew, and could consider
larger sets of moral arguments, etc.
The standard dismissals here range from “Oh, of course I would think
of X, therefore there’s no problem” for any particular X that you suggest to them,
by way of illustrating a systemic problem that they can’t seem to
grasp. Or “Well, I’ll look at the AI’s representation and see whether
it defines ‘happiness’ the same way I do.” (As if you would notice if
one of the 15 different considerations that affect what you would
define as ‘truly happy’ were left out! And also as if you could
determine, by eyeballing, whether an AGI’s internal representations
would draw a border around as-yet-unimagined borderline instances, that
you would find sensible.) Or the always popular, “But that’s stupid, therefore a superintelligence won’t make that mistake by doing something so pointless.”
One of the reasons that qualitative planning works for humans as well
as it does, is our ability to replan on the fly when an exceptional
condition shows up. Can’t the superintelligence just obviously see that
manufacturing lots of tiny Super Happy agents is stupid, which is to say ranked-low-in-our-preference-ordering? Not if its preference ordering isn’t like yours. (Followed by the appeals to universally compelling arguments demonstrating that making Super Happy agents is incorrect.)
But let’s suppose that we can magically convey to the AI exactly what a human would consider as “happiness”, by some unspecified and deep and technical art of Friendly AI. Then we have this shiny new diagram:
Of course this still doesn’t work—but first, I explain the
diagram. The dotted line between Humans::”Happy” and
happiness-in-the-world, marked “by definition”, means that the Happy
box supposedly contains whatever is meant by the human concept
of “happiness”, as modeled by the AI, which by a magical FAI trick has
been bound exactly to the human concept of “happiness”. (If the happy
box is neither what humans mean by happiness, nor what the AI means,
then what’s inside the box? True happiness? What do you mean by that?)
One glosses over numerous issues here—just as the original author
of the original Happy Smiling AI proposal did—such as whether we all
mean the same thing by “happiness”. And whether we mean something
consistent, that can be realized-in-the-world. In Humans::”Happy”
there are neurons and their interconnections, the brain state
containing the full and complete specification of the seed of
what we mean by “happiness”—the implicit reactions that we would
have, to various boundary cases and the like—but it would take some
extrapolation of volition for the AI to decide how we would
react to new boundary cases; it is not a trivial thing to draw a little
dashed line between a human thought, and a concept boundary over the
world of quarks and electrons, and say, “by definition”. It wouldn’t
work on “omnipotence”, for example: can you make a rock that you can’t
But let us assume all such issues away.
Problem 3: Is every act which increases the total amount of happiness in the universe, always the right thing to do?
If everyone in the universe just ends up with their brains hotwired
to experience maximum happiness forever, or perhaps just replaced with
orgasmium gloop, is that the greatest possible fulfillment of
humanity’s potential? Is this what we wish to make of ourselves?
“Oh, that’s not real happiness,” you say. But be wary of the No True Scotsman fallacy
- this is where you say, “No Scotsman would do such a thing”, and then,
when the culprit turns out to be a Scotsman, you say, “No true
Scotsman would do such a thing”. Would you have classified the
happiness of cocaine as “happiness”, if someone had asked you in
another context?
Admittedly, picking “happiness” as the optimization target of the AI
makes it slightly more difficult to construct counterexamples: no
matter what you pick, the one can say, “Oh, but if people saw that
happen, they would be unhappy, so the AI won’t do it.” But this
general response gives us the counterexample: what if the AI has to
choose between a course of action that leads people to believe a
pleasant fiction, or a course of action that leads to people knowing an
unpleasant truth?
Suppose you believe that your daughter has gone on a one-way,
near-lightspeed trip to the Hercules supercluster, meaning that you’re
exceedingly unlikely to ever hear from her again. This is a little
sad, but you’re proud of her—someone’s got to colonize the place,
turn it into a human habitation, before the expansion of the universe
separates it from us. It’s not as if she’s dead—now that would make you sad.
And now suppose that the colony ship strikes a landmine, or something, and is lost with all on board. Should the AI tell
you this? If all the AI values is happiness, why would it? You’ll be
sad then, and the AI doesn’t care about truth or lies, just happiness.
Is that “no true happiness”? But it was true happiness before, when
the ship was still out there. Can the difference between an instance
of the “happiness” concept, and a non-instance of the “happiness”
concept, as applied to a single individual, depend on the state of a
system light-years away? That would be rather an extreme case of “no
true Scotsman”, if so—and by the time you factor in all the other
behaviors you want out of this word “happiness”, including times when
being sad is the right thing to do, and the fact that you can’t just
rewrite brains to be happy, it’s pretty clear that “happiness” is just
a convenient stand-in for “good”, and that everything which is not good
is being rejected as an instance of “happy” and everything which is
good is being accepted as an instance of “happy”, even if it means
being sad. And at this point you just have the AI which does exactly
what it should do—which has been hooked up directly to Utility—and
that’s not a system to mention lightly; pretending that “happiness” is
your stand-in for Utility doesn’t begin to address the issues.
So if we leave aside this dodge, and consider the sort of happiness
that would go along with smiling humans—ordinary psychological
happiness—then no, you wouldn’t want to switch on the
superintelligence that always and only optimized for happiness. For
this would be the dreaded Maximum Fun Device. The SI might lie to you,
to keep you happy; even if it were a great lie, traded off against a
small happiness, always and uncaringly the SI would choose the lie.
The SI might rewire your brain, to ensure maximum happiness. The SI
might kill off all the humans, and replace us with some different form
of sentient life that had no philosophical objections to being always
happy all the time in a little jar. For the qualitative diagram
contains no mention of death as a bad thing, only happiness
as a good, and the dead are not unhappy. (Note again how all these
failures would tend to manifest, not during the AI’s early infrahuman
stages, but after it was too late.)
The generalized form of the problem, is that being in the presence of a superintelligence that shares some but not all of your terminal values, is not necessarily a good thing.
You didn’t deliberately intend to completely change the
32-bit XOR checksum of your monitor’s pixel display, when you clicked
through to this webpage. But you did. It wasn’t a property that it
would have occurred to you to compute, because it wasn’t a property
that it would occur to you to care about. Deep Blue, in the course of
winning its game against Kasparov, didn’t care particularly about “the
number of pieces on white squares minus the number of pieces on black
squares”, which changed throughout the game—not because Deep Blue was
trying to change it, but because Deep Blue was exerting its
optimization power on the gameboard and changing the gameboard, and so
was Kasparov, and neither of them particularly cared about that
property I have just specified. An optimization process that cares only
about happiness, that squeezes the future into regions ever-richer in
happiness, may not hate the truth; but it won’t notice if it squeezes
truth out of the world, either. There are many truths that make us sad
- but the optimizer may not even care that much; it may just not
notice, in passing, as it steers away from human knowledge.
On an ordinary human scale, and in particular, as a matter of
qualitative reasoning, we usually assume that what we do has little in
the way of side effects, unless otherwise specified. In part, this is
because we will visualize things concretely, and on-the-fly spot the
undesirable side effects—undesirable by any criterion that we
care about, not just undesirable in the sense of departing from the
original qualitative plan—and choose a different implementation
instead. Or we can rely on our ability to react-on-the-fly. But as
human technology grows more powerful, it tends to have more side
effects, more knock-on effects and consequences, because it does bigger
things whose effects we aren’t controlling all by hand. An infrahuman
AI that can only exert a weak influence on the world, and that makes a
few people happy, will seem to be working as its designer thought an AI
should work; it is only when that AI is stronger that it can squeeze
the future so powerfully as to potentially squeeze out anything not explicitly protected in its utility function.
Though I don’t intend to commit the logical fallacy of generalizing from fictional evidence, a nod here is due to Jack Williamson, author of With Folded Hands,
whose AIs are “to serve and protect, and guard men from harm”, which
leads to the whole human species being kept in playpens, and
lobotomized if that tends to make them unhappy.
The original phrasing of this old short story—“guard men from
harm”—actually suggests another way to illustrate the point: suppose
the AI cared only about the happiness of human males? Now to be sure,
many men are made happy by seeing the women around them happy, wives
and daughters and sisters, and so at least some females of the human
species might not end up completely forlorn—but somehow, this doesn’t
seem to me like an optimal outcome.
Just like you wouldn’t want an AI to optimize for only some of the
humans, you wouldn’t want an AI to optimize for only some of the
values. And, as I keep emphasizing for exactly this reason, we’ve got a lot of values.
These then are three problems, with strategies of Friendliness built upon qualitative reasoning that seems to imply a positive link to utility:
The fragility of normal causal links when a superintelligence searches for more efficient paths through time;
The superexponential vastness of conceptspace, and the unnaturalness of the boundaries of our desires;
And all that would be lost, if success is less than complete, and a superintelligence squeezes the future without protecting everything of value in it.
