I think the most likely scenario of actually trying this with an AI in real life is that you end up with a strategy that is convincing to humans and ends up being ineffective or unhelpful in reality.
I agree this would be much easier. However, I’m wondering why you think an AI would prefer it, if it has the capability to do either. I can see some possible reasons (e.g., an AI may not want problems of alignment to be solved). Do you think that would be an inevitable characteristic of an unaligned AI with enough capability to do this?
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan’s success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is—how large is the capability difference between “AI can produce a working plan for Y, but can’t fool us into thinking it’s a plan for X” and “AI can produce a working plan for Y that looks to us like a plan for X”?
The honest answer is “We don’t know”. Since failure could be catastrophic, this isn’t something I’d like to leave to chance, even though I wouldn’t go so far as to call the result inevitable.
I agree this would be much easier. However, I’m wondering why you think an AI would prefer it, if it has the capability to do either. I can see some possible reasons (e.g., an AI may not want problems of alignment to be solved). Do you think that would be an inevitable characteristic of an unaligned AI with enough capability to do this?
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan’s success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is—how large is the capability difference between “AI can produce a working plan for Y, but can’t fool us into thinking it’s a plan for X” and “AI can produce a working plan for Y that looks to us like a plan for X”?
The honest answer is “We don’t know”. Since failure could be catastrophic, this isn’t something I’d like to leave to chance, even though I wouldn’t go so far as to call the result inevitable.