A) Use some fancy AI system to create a rocket design, optimizing according to some specifications that we write down, and then sample rocket designs output by this system until you find one that a human approves of.
This does not seem like a realistic thing someone might do, because if the AI doesn’t output a rocket design that a human approves of after a few tries, wouldn’t the AI user decide that the specification must be wrong or incomplete, look at the specification and the AI’s outputs to figure out what is wrong, tweak the specification, and then try again? (Or alternatively decide that the AI isn’t powerful enough or otherwise suitable enough for the task.) Assuming this iterative process produces a human-approved rocket design within a reasonable amount of time, it seems obvious that it’s likely safer than B since there’s much less probability that the AI does something unintended like creating a message to the user.
Given this, I’m having trouble understanding the motivation for the question you’re asking. Can you give another example that better illustrates what you’re trying to understand?
This does not seem like a realistic thing someone might do, because if the AI doesn’t output a rocket design that a human approves of after a few tries, wouldn’t the AI user decide that the specification must be wrong or incomplete, look at the specification and the AI’s outputs to figure out what is wrong, tweak the specification, and then try again? (Or alternatively decide that the AI isn’t powerful enough or otherwise suitable enough for the task.) Assuming this iterative process produces a human-approved rocket design within a reasonable amount of time, it seems obvious that it’s likely safer than B since there’s much less probability that the AI does something unintended like creating a message to the user.
Given this, I’m having trouble understanding the motivation for the question you’re asking. Can you give another example that better illustrates what you’re trying to understand?