Cognitive Neuroscience, Arrow’s Impossibility Theorem, and Coherent Extrapolated Volition
Suppose we want to use the convergence of humanity’s preferences as the utility function of a seed AI that is about to determine the future of its light cone.
We figured out how to get an AI to extract preferences from human behavior and brain activity. The AI figured out how to extrapolate those values. But my values and your values and Sarah Palin’s values aren’t fully converging in the simulation running the extrapolation algorithm. Our simulated beliefs are converging because on the path to reflective equilibrium our partially simulated selves have become true Bayesians and Aumann’s Agreement Theorem holds. But our preferences aren’t converging quite so well.
What to do? We’d like the final utility function in the FOOMed AI to adhere to some common-sense criteria:
Non-dictatorship: No single person’s preferences should dictate what the AI does. Its utility function must take multiple people’s (extrapolated) preferences into account.
Determinism: Given the same choices, and the same utility function, the AI should always make the same decisions.
Pareto efficiency: If every (extrapolated) person prefers action A to action B, the AI should prefer A to B.
Independence of irrelevant alternatives: If we — a group of extrapolated preference-sets — prefer A to B, and a new option C is introduced, then we should still prefer A to B regardless of what we think about C.
Now, Arrow’s impossibility theorem says that we can only get the FOOMed AI’s utility function to adhere to these criteria if the extrapolated preferences of each partially simulated agent are related to each other cardinally (“A is 2.3x better than B!”) instead of ordinally (“A is better than B, and that’s all I can say”).
Now, if you’re an old-school ordinalist about preferences, you might be worried. Ever since Vilfredo Pareto pointed out that cardinal models of a person’s preferences go far beyond our behavioral data and that as far as we can tell utility has “no natural units,” some economists have tended to assume that, in our models of human preferences, preference must be represented ordinally and not cardinally.
But if you’re keeping up with the latest cognitive neuroscience, you might not be quite so worried. It turns out that preferences are encoded cardinally after all, and they do have a natural unit: action potentials per second. With cardinally encoded preferences, we can develop a utility function that represents our preferences and adheres to the common-sense criteria listed above.
Whaddya know? The last decade of cognitive neuroscience has produced a somewhat interesting result concerning the plausibility of CEV.
This post seems confused, or just confusing.
I don’t think there are many people who think that the main problem with aggregating the preferences of different people is ordinal utilities and Arrow’s impossibility theorem. Modern economists tend to think about preferences in the von Neumann-Morgenstern tradition, where one’s preferences are represented as a utility function from outcomes to real numbers, but any two utility functions that are linear transformations of each other are equivalent (so really each person’s preferences are represented by an infinite family of utility functions are that all linear transformations of each other).
How to aggregate preferences of individuals with vNM preferences is still considered an open problem, because there is no obvious “natural” way to combine two such infinite families of utility functions. A given agent might internally represent its preferences using one particular utility function out of the infinite family of equivalent utility functions, but it seems morally indefensible to use that as the basis for aggregating utility.
Why would there be a unique way to aggregate personal utility functions? It’s just like a belief in “the” objective morality, only one step removed: instead one now looks for “the” way to aggregate personal moralities.
It’s probably naive and psychologically false to imagine that there is an impersonal formula for CEV waiting to be found in the human decision architecture, even when self-idealized. The true nature of human morality is probably something like: self-interest plus sympathy plus plasticity (I don’t mean neural plasticity, just an aspect which makes personal morality changeable and culturally programmable). Plasticity can override self-interest—and so you have someone choosing torture over dust specks—but self-interest can also override plasticity—as occurs in any idealist who burns out.
It should be easy to see why self-interest and plasticity in individuals can produce egalitarian moralities at the level of culture—it’s a lowest-common-denominator answer to the political question “how do we aggregate our preferences? how do we resolve our different aims, how do we prioritize among different people?” But once a person starts thinking about values as an individual, they easily discover reasons to abandon any inculcated collective values from which they obtain no personal benefit, and it’s natural to suppose that this is what would happen if you used any actual human individual as the seed of a CEV calculation. I don’t deny the existence of the sympathy factor, but it requires further analysis. One should distinguish between an interest in the welfare of other people, because they are a source of selfish pleasure for oneself; an interest in the welfare of other people, because of a less selfish or unselfish sympathy for their condition; and even an interest in the welfare of other people, which expresses a preference in a situation where your self-interest is simply not a factor (maybe you’re dying and your remaining actions can only affect how things turn out for others; maybe you’re expressing a preference for “far” outcomes that will never return to affect you).
I find it very plausible that the amplification to superhuman proportions, of the rationally renormalized decision architecture of an individual human being, would produce a value system in which self-preservation has supreme and explicit priority, and other people are kept around and looked after for reasons which are mostly for the enjoyment of the central ego, and any altruistic component is a secondary or tertiary aspect, arising largely from the residual expression of preferences in situations where there’s no direct impact on the central ego’s well-being. If this is the case, then output of a “democratic” CEV will be culturally contingent—you won’t find it in the parts of the adult human decision architecture which are genetically determined. The “plastic” part of the personal utility function will have to have been set by the egalitarian values of democratic culture, or simply by the egalitarian personal morality of the masterminds of the CEV process.
No, I don’t know how to solve that problem. That goes beyond the scope of this post, which is concerned with Arrow’s theorem. I’ll change the wording that represents most economists as thinking in terms of ordinal preferences, though.
EDIT: Those who want some recent material on the subject, see this paper.
Can you point out the particular paragraph in your neuroscience post you’re referring to? I looked and couldn’t find anything that obviously fit the bill.
I would be very worried that you’re describing something closer to “wanting”, but that CEV should take as input something closer to “approving”
I couldn’t find the precise paragraph, either. Searching the literature lead me to pp.136-7 of Glimcher’s Foundations of Neuroeconomic Analysis. See his definitions for “subjective value” and “utility.” I imagine this is the source Luke drew the fact from.
I wonder whether this is actually true. The brain is such a complex system that the claims in your neuroscience post, while plausible, also sound like they might be dangerous oversimplifications.
Could you write a paper about this for some peer-reviewed journal? This is a claim that I’d really, really, really want to subject to the full examination of the whole academic world to see whether it checks out in the end. (Also, it seems like a relatively easy way to get an insane number of citations, if the claim gets taken seriously by economists.)
But should we? What about utility monsters?
Even if our current utility functions have some sort of “objective” measurement, can we still be confident that it can be objectively extrapolated? What if the first thing I want to do with the capacity for self-modification is self-modify myself to have the cognitive utility function unew(observations) = 100*uold(observations)? That doesn’t directly change my own decisions at all (assuming it doesn’t burn out my brain), but it suddenly gives me many more “votes” in any collective decision-making.
Here is what I think about CEV on different levels:
Rational reflection: Currently the best idea on how to survive an AI Singularity and maximize utility for all of humanity.
Personal preferences: I don’t care much to live in an universe where some god-like process already figured everything out for us, or could in principle but doesn’t do it because we want to do it the slow way.
Intuition: Bullshit, this will never work out. I can’t believe anyone other than science fiction writers are even considering such scenarios.
Reflective equilibrium: Here.
Some people who are fond of the work done by the SIAI might be appalled by my intuition, but I am only being honest here. As you see, on a LW-type rationality level I pretty much agree with the CEV proposal, but that hasn’t been able to change my gut feelings on the subject yet.
If its nature is to follow our preference to avoid spoilers, then in a very real sense it couldn’t in principle do otherwise.
The problem is the knowledge that there does exist an oracle that could answer any question, or the knowledge to create one, if humanity wanted it to (if it was our preference). It pretty much destroys any curiosity.
Right now I enjoy learning new knowledge that is already known because it makes me more knowledgeable than other people. In future, where there is a process like CEV, that is completely unnecessary because the only reason for why people stay stupid is that they want to. Right now there is also the curiosity involved that learning will ultimately lead me into unexplored territory. Under CEV any unexplored territory is unexplored by choice. I also enjoy reading science fiction and thinking about the future, under CEV that’s just stupid.
SPOILER ALERT—The following is a quote from the novel Ventus:
This is a rather strong claim—which I believe might quite easily be disproven if we ask fans of novels whether they’re curious about the end of said novels (especially mysteries), even though they have the capacity to flip to the end.
Does that mean that you’d find just as much enjoyment in removing knowledge from the heads of other people?
Is reading Fantasy, and thinking about alternate impossible universe right now “just stupid”? Or studying mythology for that matter?
Is someone playing chess just stupid, because they wouldn’t stand a chance against Kasparov or Deep Blue?
Excellent comment. I would add to this:
Is someone playing chess against Deep Blue stupid, because they could just unplug it?
My preferences are a very delicate issue that can be easily disappointed. I can’t change reality so I simply have to live with the fact that nothing can travel faster than light even though I would love to (maybe not, just an example). The same might be true for CEV, it might be the only option. But I would love to see a different future, one with more adventure and struggle, something more akin to Orion’s Arm.
And no, choosing to just forget about the CEV process wouldn’t be what I wanted either. Just imagine an alien came along offering you its help, you’d hardly ask it to leave and make you forget that it does exist.
No, but I personally don’t like fantasy. Science fiction still has some appearance of plausibility, which would be removed under CEV.
No, I quite enjoy playing games that I know that I can’t win. But under CEV that would mainly be a result of choice, if humanity wanted it then we would become all equal.
This seems similar to the argument made by people about death, “every song ends, but is that any reason not to enjoy the music?” And a song ends doesn’t stop me from listening to music, but knowing that my life would end in a week or 10 years would pretty much destroy any fun that I could have in the remaining time. As I said, my preferences seem to be complex and CEV could easily destroy the delicate balance.
It would still be awesome to explore nature and play games of course. But I am pretty sure that people wouldn’t build particle accelerators if they could just “flip to the end” and read about the theory of everything. Or even if they would do so, a lot of the previous excitement, that was result of the possibility to discover something novel, would be gone. After all, nobody writes papers about figuring out the plot of a fiction story and receives a Nobel prize for it.
Created a quick video to highlight how I feel about CEV, “Interesting future vs. CEV”.
I don’t think that term means what you think it means. When I try to picture life in a world shaped by CEV, I wind up imagining my new incarnation designing (perhaps as part of a team) visually beautiful ways to harness nearly the full energy of a star.
You seem to assume either that potential knowledge is finite—which technically seems impossible—or else that you could never grow into a mind as smart as the FAI. If the second assumption leads to unhappiness the FAI will discover this fact and try to make sure the premise does not hold.
Related: Amputation of Destiny
Would you mind tabooing the word “preference” and re-writing this post? It’s not clear to me that the research cited in your “crash course” post actually supports what you seem to be claiming here.