For most Js I agree, but the existence of any adversarial examples for J would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)
I think it extremely probable that there exist policies which exploit adversarial inputs to J such that they can do bad stuff while getting J to say “all’s fine.”
For most Js I agree, but the existence of any adversarial examples for J would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)