I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn’t usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions.
This is cute, but I have strong qualms with your 3rd prediction; I don’t disagree, per se, but
Either “variants of this approach” is too broad to be useful, including things like safety by debate and training a weak AI to check the input
Or, if I take “variants” narrowly to mean using an AI to check its own inputs, my estimate is “basically zero”
So I want to double check: what counts as a variant and what doesn’t?
I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn’t usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions.
In that case, I don’t see why the problem of “system alignment” or “supervisor alignment” is any simpler or easier than “supervisee alignment”.