I think Nate’s original argument holds, but might need a bit if elaboration:
Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, [both] on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, and [...].
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can’t.
So the observation “this seems good and I am super-convincend i could tell if it wasn’t”—by itself—isn’t enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn’t try to deceive you, or why it wouldn’t search through dangerous plans, yada yada. But essentially that means you aren’t relying on the plan verification step anymore.
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.
I think Nate’s original argument holds, but might need a bit if elaboration:
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can’t.
So the observation “this seems good and I am super-convincend i could tell if it wasn’t”—by itself—isn’t enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn’t try to deceive you, or why it wouldn’t search through dangerous plans, yada yada. But essentially that means you aren’t relying on the plan verification step anymore.
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.