Rohin Shah comments on Reliability amplification

Rohin Shah 4 Feb 2019 2:20 UTC
LW: 6 AF: 4
AF
(Rambling + confused, I’m trying to understand this post)
It seems like all of this requires the assumption that our agents have a small probability of failure on any given input. If there are some questions on which our agent is very likely to fail, then this scheme actually hurts us, amplifying the failure probability on those questions. Ah, right, that’s the problem of security amplification. So really, the point of reliability amplification is to decrease the chance that the agent becomes incorrigible, which is a property of the agent’s “motivational system” that doesn’t depend on particular inputs. And if any part of a deliberation tree is computed incorrigibly, then the output of the deliberation is itself incorrigible, which is why we get the amplification of failure probability with capability amplification.
When I phrase it this way, it seems like this is another line of defense that’s protecting against the same thing as techniques for optimizing worst-case performance. Do you agree that if those techniques work “perfectly” then there’s no need for reliability amplification?
This is an interesting failure model though—how does incorrigibility arise such that it is all-or-nothing, and doesn’t depend on input? Why aren’t there inputs that almost always cause our agent to become incorrigible? I suppose the answer to that is that we’ll start with an agent that uses such small inputs that it is always corrigible, and our capability amplification procedure will ensure that we stay corrigible.
But then in that case why is there a failure probability at all? That assumption is strong enough to say that the agent is never incorrigible.
TL;DR: Where does the potential incorrigibility arise from in the first place? I would expect it to arise in response to a particular input, but that doesn’t seem to be your model.
What links here?
- Alignment Newsletter #45 by Rohin Shah (14 Feb 2019 2:10 UTC; 25 points)
- Alignment Newsletter #44 by Rohin Shah (6 Feb 2019 8:30 UTC; 18 points)