Can there be an indescribable hellworld?
Can there be an indescribable hellworld? What about an un-summarisable one?
By hellworld, I mean a world of very low value according to our value scales—maybe one where large number of simulations are being tortured (aka mind crimes).
A hellworld could look superficially positive, if we don’t dig too deep. It could look irresistibly positive.
Could it be bad in a way that we would find indescribable? It seems that it must be possible. The set of things that can be described to us is finite; the set of things that can be described to us without fundamentally changing our values is much smaller still. If a powerful AI was motivated to build a hellworld such that the hellish parts of it were too complex to be described to us, it would seem that it could. There is no reason to suspect that the set of indescribable worlds contains only good worlds.
Can it always be summarised?
Let’s change the setting a bit. We have a world , and a powerful AI that is giving us information about . The is aligned/friendly/corrigible or whatever we need to be. It’s also trustworthy, in that it always speaks to us in a way that increases our understanding.
Then if is an indescribable hellworld, can summarise that fact for us?
It seems that it can. In the very trivial sense, it can, by just telling us “it’s an indescribable hellworld”. But it seems it can do more than that, in a way that’s philosophically interesting.
A hellworld is ultimately a world that is against our values. However, our values are underdefined and changeable. So to have any chance of saying what these values are, we need to either extract key invariant values, synthesise our contradictory values into some complete whole, or use some extrapolation procedure (eg CEV). In any case, there is a procedure for establishing our values (or else the very concept of “hellworld” makes no sense).
Now, it is possible that our values themselves may be indescribable to us now (especially in the case of extrapolations). But can at least tell us that is against our values, and provide some description as to the value it is against, and what part of the procedure ended up giving us that value. This does give us some partial understanding of why the hellworld is bad—a useful summary, if you want.
On a more meta level, imagine the contrary—that was hellworld, but the superintelligent agent could not indicate what human values it actually violated, even approximately. Since our values are not some exnihilio thing floating in space, but derived from us, it is hard to see how something could be against our values in a way that could never be summarised to us. That seems almost definitionally impossible: if the violation of our values can never be summarised, even at the meta level, how can it be a violation of our values?
Trustworthy debate is FAI complete
It seems that the consequence of that is that we can avoid hellworlds (and, presumably, aim for heaven) by having a corrigible and trustworthy AI that engages in debate or is a devil’s advocate. Now, I’m very sceptical of getting corrigible or trustworthy AIs in general, but it seems that if we can, we’ve probably solved the FAI problem.
Note that even in the absence of a single given way of formalising our values, the AI could list the plausible formalisations for which was or wasn’t a hellworld.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Research Agenda v0.9: Synthesising a human’s preferences into a utility function by 17 Jun 2019 17:46 UTC; 70 points) (
- The Pointers Problem: Clarifications/Variations by 5 Jan 2021 17:29 UTC; 61 points) (
- Does Bayes Beat Goodhart? by 3 Jun 2019 2:31 UTC; 48 points) (
- Defeating Goodhart and the “closest unblocked strategy” problem by 3 Apr 2019 14:46 UTC; 45 points) (
- How to solve deception and still fail. by 4 Oct 2023 19:56 UTC; 40 points) (
- Debate Minus Factored Cognition by 29 Dec 2020 22:59 UTC; 37 points) (
- A theory of human values by 13 Mar 2019 15:22 UTC; 28 points) (
- Goodhart: Endgame by 19 Nov 2021 1:26 UTC; 25 points) (
- Take 3: No indescribable heavenworlds. by 4 Dec 2022 2:48 UTC; 23 points) (
- [AN #158]: Should we be optimistic about generalization? by 29 Jul 2021 17:20 UTC; 20 points) (
- Sources of evidence in Alignment by 2 Jul 2023 20:38 UTC; 20 points) (
- Alignment Newsletter #44 by 6 Feb 2019 8:30 UTC; 18 points) (
- 7 Jan 2021 9:57 UTC; 11 points) 's comment on The Pointers Problem: Clarifications/Variations by (
- 6 Jan 2021 18:35 UTC; 5 points) 's comment on The Pointers Problem: Clarifications/Variations by (
- 1 Jul 2021 16:19 UTC; 4 points) 's comment on paulfchristiano’s Shortform by (
- 9 May 2022 13:08 UTC; 2 points) 's comment on Updating Utility Functions by (
It feels worth distinguishing between two cases of “hellworld”:
1. A world which is not aligned with the values of that world’s inhabitants themselves. One could argue that in order to merit the designation “hellworld”, the world has to be out of alignment with the values of its inhabitants in such a way as to cause suffering. Assuming that we can come up with a reasonable definition of suffering, then detecting these kinds of worlds seems relatively straightforward: we can check whether they contain immense amounts of suffering.
2. A world whose inhabitants do not suffer, but which we might consider hellish according to our values. For example, something like a Brave New World scenario, where people generally consider themselves happy but where that happiness comes at the cost of suppressing individuality and promoting superficial pleasures.
It’s for detecting an instance of the second case that we need to understand our values better. But it’s not clear to me that such a world should qualify as a “hellworld”, which to me sounds like a world with negative value. While I don’t find the notion of being the inhabitant of a Brave New World particularly appealing, a world where most people are happy but only in a superficial way sounds more like “overall low positive value” than “negative value” to me. Assuming that you’ve internalized its values and norms, existing in a BNW doesn’t seem like a fate worse than death, it just sounds like a future that could have gone better.
Of course, there is an argument that even if a BNW would be okay to its inhabitants once we got there, getting there might cause a lot of suffering: for instance, if there were lots of people who were forced against their will to adapt to the system. Since many of us might find the BNW to be a fate worse than death, then conditional on us surviving to live in the BNW, it’s a hellworld (at least to us). But again this doesn’t seem like it requires a thorough understanding of our values to detect: it just requires detecting the fact that if we survive to live in the BNW, we will experience a lot of suffering due to being in a world which is contrary to our values.
Checking whether there is a large amount of suffering in a deliberately obfuscated world seems hard, or impossible if a superintelligent has done the obfuscating.
True, not disputing that. Only saying that it seems like an easier problem than solving human values first, and then checking whether those values are satisfied.
What’s the evidence for this set being “much smaller”?
Can you imagine sitting through a ten-year lecture without your values changing? Can you imagine sitting through that lecture without your values changing somewhat in reaction to the content?
This seems like it would mainly affect instrumental values rather than terminal ones.
In many areas, we have no terminal values until the problem is presented to us, then we develop terminal values (often dependent on how the problem was phrased) and stick to them. Eg the example with Soviet and American journalists visiting each other’s countries.
I think a little more formal definition of “describable” and “summarizable” would help me understand. I start with a belief that any world is it’s own best model, so I don’t think worlds are describable in full I may be wrong—world-descriptions may compress incredibly well, and it’s possible to describe a world IN ANOTHER WORLD. but fully describing a world inside a subset of that world itself cannot be done.
“summarizable” is more interesting. If it just means “a trustworthy valuation”, then fine—it’s possible, and devolves to “is there any such thing as a trustworthy summarizer”. If it means some other subset of a description, then it may or may not be possible.
Thinking about other domains, a proof is a summary of (an aspect of) a formal system. It provides a different level of information than is contained in the base axioms. Can we model “summary” of the suffering level/ratio/hellishness of a world in the same terms. It’s not about trusting the agent, it’s about the agent finding the subset of information about the world that shows us that the result is true.
Maybe the right question here is: is it possible to create more and more strong qualia of pain, or the level of pain is limited.
If maximum level of pain is limited, by, say, 10 of 10, when evil AI have to create complex worlds, like in the story “I have not mouth but I must scream”, trying to affect many our values in most unpleasant combination, that is playing anti-music by pressing different values.
If there is no limits to the possible intensity of pain, the evil AI will invest more in upgrading human brain so it will be able to feel more and more pain. In that case there will be no complexity but just growing intensity. One could see this type of hell in the ending of the last Trier movie “The house that Jack built”. This type of hell is more disturbing to me.
In the Middle Ages the art of torture existed, and this distinction also existed: some tortures were sophisticated, but other were simple but infinitely intense, like the testicle torture.
But you seem to have described these hells quite well—enough for us to clearly rule them out.
I don’t understand why you are ruling them out completely: at least at personal level long intense suffering do exist and happened in mass in the past (cancer patients, concentration camps, witch hunting).
I suggested two different argument against s-risks:
1) Anthropic: s-risks are not dominating type of experience in the universe, or we will be already here.
2) Larger AIs could “save” minds from smaller but evil AIs by creating many copies of such minds and thus creating indexical uncertainty (detailed explanation here), as well as punish copies of such AI for this, and thus discouraging any AI to implement s-risks.
The question of this post is whether there exist indescribable hellworlds—worlds that are bad, but where it cannot be explained to humans how/why they are bad.
Yes, I probably understood “indescribable” as a synonymous of “very intense”, not of literary “can’t be described”.
But now I have one more idea about really “indescribable hellworld”: imagine that there is a qualia of suffering which is infinitely worse than anything that any living being ever felt on Earth, and it appears in some hellword, but only in animals or in humans who can’t speak (young children, patients just before death, or it paralises the ability to speak by its intensity and also can’t be remembered—I read historical cases of pain so intense that a person was not able to provide very important information).
So, this hellworld will look almost as our normal world: animals live and die, people live normal and happy (in time-average) lives and also die. But some counterfactual observer which will be able to feel qualia of any living being will find it infinitely more hellish than our world.
We also could live now in such hellworld but don’t know it.
The main reason why it can’t be described as most people don’t believe in qualia, and and observable characteristics of this world will be not hellish. Beings in such world could be also called reverse-p-zombies, as they have much more stronger capability to “experiencing” than ordinary humans.
Indeed. But you’ve just described it to us ^_^
What I’m mainly asking is “if we end up in world W, and no honest AI can describe to us how this might be a hellworld, is it automatically not a hellworld?”
It looks like examples are not working here, as any example is an explanation, so it doesn’t count :)
But in some sense it could be similar to the Godel theorem: there are true propositions which can’t be proved by AI (and explanation could be counted as a type of prove).
Ok, another example: there are bad pieces of art, I know it, but I can’t explain why they are bad in formal language.
That’s what I’m fearing, so I’m trying to see if the concept makes sense.
Do you think you’d agree with a claim of this form applied to corrigibility of plans/policies/actions?
That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.
Given some definition of corrigibility, yes.