The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then [...]
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.
Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
Btw, in another comment, you say
But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.
I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.
I basically mean the third scenario:
I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
Btw, in another comment, you say
I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.