To steelman the parent argument a bit, a simple policy can be dangerous, but if an agent proposed a simple and dangerous policy to us, we probably would not implement it (since we could see that it was dangerous), and thus the agent itself would not be dangerous to us.
If the agent were to propose a policy that, as far as we could tell, appears safe, but was in fact dangerous, then simultaneously:
That doesn’t seem true. Simple policies can be dangerous and more powerful than I am.
To steelman the parent argument a bit, a simple policy can be dangerous, but if an agent proposed a simple and dangerous policy to us, we probably would not implement it (since we could see that it was dangerous), and thus the agent itself would not be dangerous to us.
If the agent were to propose a policy that, as far as we could tell, appears safe, but was in fact dangerous, then simultaneously:
We didn’t understand the policy.
The agent was dangerous to us.