I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan’s success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is—how large is the capability difference between “AI can produce a working plan for Y, but can’t fool us into thinking it’s a plan for X” and “AI can produce a working plan for Y that looks to us like a plan for X”?
The honest answer is “We don’t know”. Since failure could be catastrophic, this isn’t something I’d like to leave to chance, even though I wouldn’t go so far as to call the result inevitable.
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan’s success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is—how large is the capability difference between “AI can produce a working plan for Y, but can’t fool us into thinking it’s a plan for X” and “AI can produce a working plan for Y that looks to us like a plan for X”?
The honest answer is “We don’t know”. Since failure could be catastrophic, this isn’t something I’d like to leave to chance, even though I wouldn’t go so far as to call the result inevitable.