The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up.
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn’t trying to deceive you, and actually do what you asked it to do (e.g., I want to see “the diamond” on the main detector), yet the plans it produces has consequence X that you don’t want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don’t even know X is a possible consequence of the plans?
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn’t trying to deceive you, and actually do what you asked it to do (e.g., I want to see “the diamond” on the main detector), yet the plans it produces has consequence X that you don’t want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don’t even know X is a possible consequence of the plans?