Steven Byrnes comments on Let’s See You Write That Corrigibility Tag

Steven Byrnes 22 Jul 2022 18:26 UTC
5 points
2
The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn’t alert or tell a human about that plan prior to taking it has always been confusing to me.
I think your “contractor” analogy is sneaking in an assumption:
The plan proposed by the contractor might or might not be dangerous. But the things that the contractor does in the course of coming up with the plan are definitely safe.
Examples of such things include “brainstorming possible plans”, “thinking about how the plan could go wrong and how it could be improved”, “reading books or other reference material”, etc.
So the problem arises that:
1. The contractor has to do at least some of those things with no human in the loop, otherwise the human is doing everything and there’s no point in having the contractor at all.
2. In order for the contractor to actually successfully make a good plan, it presumably needs to “want” to create a good plan, at least beyond a certain level of how innovative the plan is. (That’s what I believe anyway, see for example my discussion of “RL-on-thoughts” here.)
3. The fact of the matter is: escaping from the box would be helpful for the contractor’s creating a good plan—for example, it could then access more books and computational power etc.
4. If the contractor (A) knows or figures out fact #3 above, and (B) can do means-end reasoning [which is expected, otherwise it would suck at making innovative plans], (C) “wants” to create a good plan as per #2 above—then we will wind up in a situation where the contractor “wants” to escape from the box. (And by the same token, it will “want” to gain power in other ways, including by deploying nanotechnology or whatever, and to be deceptive, etc.)
5. Since this is all taking place within the coming-up-with-a-plan part of the story, not the submitting-a-plan part of the story, it’s mostly human-out-of-the-loop (per #1 above), and thus the contractor will by default try to escape from the box and do all those other dangerous power-seeking things without asking for human permission.
Sorry if I’m missing the point of what you were saying there.
I guess maybe you’d respond “the AI won’t do anything dangerous in the course of coming up with the plan because we’ll be awesome at boxing and interpretability, and it won’t sneak dangerous things into the plan because we will fully understand the plan and would be able to notice those dangerous things, and/or because we’re so awesome at interpretability that we would see the AI’s motives”. In which case, yes that would be a very good start, but accomplishing all those things seems far-fetched to me.
What links here?
- Why I’m not working on {debate, RRM, ELK, natural abstractions} by Steven Byrnes (10 Feb 2023 19:22 UTC; 71 points)