I read your critique as roughly “Our prior on systems more powerful than us should be that they are not controllable or foreseeable. So trying to use one system as a tool to another system’s safety, we can not even know all failure modes.”
I think this is true if the systems are general enough that we can not predict their behavior. However, my impression of, e.g., debate or AI helpers for alignment research is that those would be narrow, e.g., only next token prediction. The Godzilla analogy implies something where we have no say in its design and can not reason about its decisions, which both seem off looking at what current language models can do.
I read your critique as roughly “Our prior on systems more powerful than us should be that they are not controllable or foreseeable. So trying to use one system as a tool to another system’s safety, we can not even know all failure modes.”
I think this is true if the systems are general enough that we can not predict their behavior. However, my impression of, e.g., debate or AI helpers for alignment research is that those would be narrow, e.g., only next token prediction. The Godzilla analogy implies something where we have no say in its design and can not reason about its decisions, which both seem off looking at what current language models can do.