Can there be an indescribable hellworld?

Stuart_Armstrong29 Jan 2019 15:00 UTC

LW: 39 AF: 22

Debate (AI safety technique)Complexity of Value

Can there be an indescribable hellworld? What about an un-summarisable one?

By hellworld, I mean a world of very low value according to our value scales—maybe one where large number of simulations are being tortured (aka mind crimes).

A hellworld could look superficially positive, if we don’t dig too deep. It could look irresistibly positive.

Could it be bad in a way that we would find indescribable? It seems that it must be possible. The set of things that can be described to us is finite; the set of things that can be described to us without fundamentally changing our values is much smaller still. If a powerful AI was motivated to build a hellworld such that the hellish parts of it were too complex to be described to us, it would seem that it could. There is no reason to suspect that the set of indescribable worlds contains only good worlds.

Can it always be summarised?

Let’s change the setting a bit. We have a world $W$ , and a powerful AI $A$ that is giving us information about $W$ . The $A$ is aligned/friendly/corrigible or whatever we need to be. It’s also trustworthy, in that it always speaks to us in a way that increases our understanding.

Then if $W$ is an indescribable hellworld, can $A$ summarise that fact for us?

It seems that it can. In the very trivial sense, it can, by just telling us “it’s an indescribable hellworld”. But it seems it can do more than that, in a way that’s philosophically interesting.

A hellworld is ultimately a world that is against our values. However, our values are underdefined and changeable. So to have any chance of saying what these values are, we need to either extract key invariant values, synthesise our contradictory values into some complete whole, or use some extrapolation procedure (eg CEV). In any case, there is a procedure for establishing our values (or else the very concept of “hellworld” makes no sense).

Now, it is possible that our values themselves may be indescribable to us now (especially in the case of extrapolations). But $A$ can at least tell us that $W$ is against our values, and provide some description as to the value it is against, and what part of the procedure ended up giving us that value. This does give us some partial understanding of why the hellworld is bad—a useful summary, if you want.

On a more meta level, imagine the contrary—that $W$ was hellworld, but the superintelligent agent $A$ could not indicate what human values it actually violated, even approximately. Since our values are not some exnihilio thing floating in space, but derived from us, it is hard to see how something could be against our values in a way that could never be summarised to us. That seems almost definitionally impossible: if the violation of our values can never be summarised, even at the meta level, how can it be a violation of our values?

Trustworthy debate is FAI complete

It seems that the consequence of that is that we can avoid hellworlds (and, presumably, aim for heaven) by having a corrigible and trustworthy AI that engages in debate or is a devil’s advocate. Now, I’m very sceptical of getting corrigible or trustworthy AIs in general, but it seems that if we can, we’ve probably solved the FAI problem.

Note that even in the absence of a single given way of formalising our values, the AI could list the plausible formalisations for which $W$ was or wasn’t a hellworld.

What links here?

Stuart_Armstrong29 Jan 2019 15:00 UTC

LW: 39 AF: 22

19 comments2 min readLW link

Debate (AI safety technique)Complexity of Value