Adele Lopez comments on Adele Lopez’s Shortform

Adele Lopez 25 Nov 2021 19:02 UTC
8 points
Re: Yudkowsky-Christiano-Ngo debate

Trying to reach toward a key point of disagreement.

Eliezer seems to have an intuition that intelligence will, by default, converge to becoming a coherent intelligence (i.e. one with a utility function and a sensible decision theory). He also seems to think that conditioned on a pivotal act being made, it’s very likely that it was done by a coherent intelligence, and thus that it’s worth spending most of our effort assuming it must be coherent.

Paul and Richard seem to have an intuition that since humans are pretty intelligent without being particularly coherent, it should be possible to make a superintelligence that is not trying to be very coherent, which could be guided toward performing a pivotal act.

Eliezer might respond that to the extent that any intelligence is capable of accomplishing anything, it’s because it is (approximately) coherent over an important subdomain of the problem. I’ll call this the “domain of coherence”. Eliezer might say that a pivotal act requires having a domain of coherence over pretty much everything: encompassing dangerous domains such as people, self, and power structures. Corrigibility seems to interfere with coherence, which makes it very difficult to design anything corrigible over this domain without neutering it.

From the inside, it’s easy to imagine having my intelligence vastly increased, but still being able and willing to incoherently follow deontological rules, such as Actually Stopping what I’m doing if a button is pressed. But I think I might be treating “intelligence” as a bit of a black box, like I could still feel pretty much the same. However, to the extent where I feel pretty much the same, I’m not actually thinking with the strategic depth necessary to perform a pivotal act. To properly imagine thinking with that much strategic depth, I need to imagine being able to see clearly through people and power structures. What feels like my willingness to respond to a shutdown button would elide into an attitude of “okay, well I just won’t do anything that would make them need to stop me” and then into “oh, I see exactly under what conditions they would push the button, and I can easily adapt my actions to avoid making them push it”, to the extent where I’m no longer being being constrained by it meaningfully. From the outside view, this very much looks like me becoming coherent w.r.t the shutdown button, even if I’m still very much committed to responding incoherently in the (now extremely unlikely) event it is pushed.

And I think that Eliezer foresees pretty much any assumption of incoherence that we could bake in becoming suspiciously irrelevant in much the same way, for any general intelligence which could perform a pivotal act. So it’s not safe to rely on any incoherence on part of the AGI.

Sorry if I misconstrued anyone’s views here!