philip_b comments on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

philip_b 15 Jul 2020 23:53 UTC
5 points
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.

Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.

(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)

Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.

I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.

Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.

Btw, in another comment, you say

But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.

I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.