Our values are underdefined, changeable, and manipulable
Crossposted at Intelligent Agent Forum.
When asked whether “communist” journalists could report freely from the USA, only 36% of 1950 Americans agreed. A follow up question about Amerian journalists reporting freely from the USSR got 66% agreement. When the order of the questions was reversed, 90% were in favour of American journalists—and an astounding 73% in favour of the communist ones.
There are many examples of survey responses depending on question order, or subtle issues of phrasing.
So there are people whose answers depended on question order. What then are the “true” values of these individuals?
Underdetermined values
I think the best way of characterising their values is to call them “underdetermined”. There were/are presumably some people for which universal freedom of the press or strict national security were firm and established values. But for most, there were presumably some soft versions of freedom of the press and nationalism, and the first question triggered one narrative more strongly than the other. What then, are their “real” values? That’s the wrong question—akin to asking if Argentina really won the 1986 world cup.
Politicians can change the opinions of a large sector of the voting public with a single pronouncement—were the people’s real opinions the ones before, or the ones after? Again, this seems to be the wrong question. But don’t people fret about this inconsistency? I’d wager that they aren’t really aware of this, because people are the most changeable on issues they’ve given the least thought to.
And rationalists and EAs are not immune to this—we presumably don’t shift much on what we identify as our core values, but on less important values, we’re probably as changeable as anyone. But such contingent values can become very strong if attacked, thus becoming a core part of our identity—even if it’s very plausible we could have held the opposite position in a world slightly different.
Frameworks and moral updating
People often rely on a small number of moral frameworks and principles to guide them. When a new moral issue arises, we generally try and fit it into a moral framework—and when there are multiple ones that could fit, we can go in multiple directions, driven by mood, bias, tribalism, and many other contingent factors.
The moral frameworks themselves can and do shift, due to issues like tribalism, cognitive dissonance, life experience, and our own self-analysis. Or the frameworks can accumulate so many exceptions or refinements, that they transform in practice if not in name—it’s very interesting that my leftist opinions agree with Anders Sandberg’s libertarian opinions on most important issues. We seem to have changed positions without changing labels.
Metaethics
In a sense, you could see all of metaethics as the refinement and analysis of these frameworks. There are urges towards simplicity, to get a more stable and elegant system, and towards complexity, to capture the full spectrum of human values. Much of philosophical disagreement can be seen as “Given A, proposition B (generally acceptable conclusion) implies C (controversial position I endorse)”, to which the response is “C is wrong, thus A (or B) is wrong as stated and needs to be refined or denied”—the logic is generally accepted, but which position is kept varies.
Since ethical disagreements are rarely resolved, it’s likely that the positions of professional philosophers, though more consistent, are also often driven by contingent and random factors. The process is not completely random—ethical ideas that are the least coherent, like the moral foundation of purity, tend to get discarded—but is certainly contingent. As before, I argue you should focus on the procedure P by which philosophers update their opinions, rather than the (hypothetical) R to which P may be supposed to converge to.
Most people, however, will not have consistent meta-ethics, as they haven’t considered these questions. So their meta-opinions there will be even more subject to random influences that their base-level opinions.
Future preferences
There is an urgent question dividing the future world: should local FLOOBS be allowed to restrict use of BLARGS, or instead ORFOILS should pressure COLATS to agree to FLAPPLE the SNARFS.
Ok, we don’t currently know what future political issues will be, but it’s clear there will be new issues (how do we know this? Because nobody cares today whether Richard Lionheart and Phillip August of France lacked in their feudal duties to each other, nor did the people of that period worry much about medical tort reform). And people will take positions on them, and they will be incorporated into moral frameworks, causing those frameworks to change, and eventually philosophers may incorporate enough change into new metaethical frameworks.
I think it’s fair to say that our current positions on these future issues are even more under-determined than most of our values.
Contingent means manipulable
If our future values are determined by contingent facts, then a sufficiently powerful and intelligent agent can manipulate our values, by manipulating those facts. However, without some sort of learning-processes-with-contingent-facts, our values are underdetermined, and hence an agent that wanted to maximise human values/reward wouldn’t know what to do.
It was this realisation, that the agent could manipulate the values it was supposed to maximise, that caused me to look at ways of avoiding this.
Choices need to be made
We want a safe way to resolve the under-determination in human values, a task that gets more and more difficult as we move away from the usual world of today and into the hypothetical world that a superpowered AI could build.
But, precisely because of the under-determination, there are doing to be multiple ways of resolving this safely. Which means that choices will need to be made as to how to do so. The process of making human values fully rigorous, is not value-free.
(A minor example, that illustrated for me a tiny part of the challenge: does the way we behave when we’re drunk reveal our true values? And the answer: do you want it to? If there is a divergence in drunk and sober values, then accommodating drunk values is a decision—one that will likely be made sober.)
- The two-layer model of human values, and problems with synthesizing preferences by 24 Jan 2020 15:17 UTC; 70 points) (
- Research Agenda v0.9: Synthesising a human’s preferences into a utility function by 17 Jun 2019 17:46 UTC; 70 points) (
- Introduction to Reducing Goodhart by 26 Aug 2021 18:38 UTC; 48 points) (
- Learning preferences by looking at the world by 12 Feb 2019 22:25 UTC; 43 points) (
- Definitions of “objective” should be Probable and Predictive by 6 Jan 2023 15:40 UTC; 43 points) (
- Can there be an indescribable hellworld? by 29 Jan 2019 15:00 UTC; 39 points) (
- Have you felt exiert yet? by 5 Jan 2018 17:03 UTC; 28 points) (
- Stable Pointers to Value III: Recursive Quantilization by 21 Jul 2018 8:06 UTC; 20 points) (
- Full toy model for preference learning by 16 Oct 2019 11:06 UTC; 20 points) (
- Defining the ways human values are messy by 27 Mar 2018 22:42 UTC; 9 points) (
- Normative assumptions: answers, emotions, and narratives by 3 Nov 2017 15:27 UTC; 7 points) (
- 29 Jan 2019 15:50 UTC; 5 points) 's comment on Can there be an indescribable hellworld? by (
- 28 Sep 2021 14:57 UTC; 2 points) 's comment on Introduction to Reducing Goodhart by (
- 27 Mar 2022 12:36 UTC; 1 point) 's comment on Introduction to Reducing Goodhart by (
- Reward learning summary by 28 Nov 2017 15:55 UTC; 0 points) (
According to legend, ancient Sumerians, if they made a major decision when sober, would reconsider it while drunk, and vice versa, and would only implement ones that their sober and drunk selves agreed on.
I’m confused why this would be an example of a wrong question. It seems like a perfectly straigtforward one, with a straightforward answer (https://en.wikipedia.org/wiki/1986_FIFA_World_Cup#Final).
See the link in the post.
This means you are trying to Procustes the human squishiness into legibility, with consistent values. You should, instead, be trying to make pragmatic AIs that would frame the world for the humans, in the ways that the humans would approve*, taking into account their objectively stupid incoherence. Because that would be Friendly and parsed as such by the humans.
*=this doesn’t mean that such human preferences as those that violate meta-universalizability from behind the veil of ignorance should not be factored out of the calculation of what is ethically relevant; but it means that the states of the world that violate those preferences should still be hidden from the humans who have those preferences. This obviously results in humans being allowed to more accurately see the states of the world, the more their preferences are tolerant of other people’s preferences; there is absolutely nothing that could possibly ever go wrong from this, considering that the AIs, being Friendly, would simply prevent them from sociopathically exploiting that information asymmetry since that would violate the ethical principle.
>pragmatic AIs that would frame the world for the humans, in the ways that the humans would approve
The choice of how to do that is equivalent with choosing among the human values. That’s not to say that there are not better or worse ways of doing things, but as soon as human behaviour become legible to an AI, we have to be very specific about any squishiness we want to preserve, and encode those in AI values.
I agree with most of this post, and in fact my recent posts (at my blog, not here) imply something similar. But I think there is a mistaken idea in this particular statement: “We want a safe way to resolve the under-determination in human values, a task that gets more and more difficult as we move away from the usual world of today and into the hypothetical world that a superpowered AI could build.”
It looks like you are saying that we need a way to make sure that the future, however distant, will always be somewhat acceptable to current humans. But this is impossible in principle, given the fact that things are tending towards the heat death of the universe. What we actually should want is that the universe should move at any particular time towards things that the beings in existence value at that time. Obviously creatures in the future will have different values, and given a long enough time period, a future will therefore come into existence that we as we are would have no particular interest in. But we also should no particular interest in preventing it from coming into being; that interest comes from a positively unreasonable extrapolation of your current interests.