You define robustness to scaling down as “a solution to alignment keeps working if the AI is not optimal or perfect.” but for interpretability you talk about “our interpretability is merely good or great, but doesn’t capture everything relevant to alignment” which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?
There’s definitely one big difference between how Scott defined it and how I’m using it, which you highlighted well. I think a better way of explaining my change is that in Scott’s original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.
I hadn’t made the explicit claim in my head or in the post, but thanks to your comment, I think I’m claiming that the version I’m proposing generalize one of the interesting part of the original definition, and let it be applied to more settings.
As for your question, there is a difference between flawed and not the strongest version. What I’m saying about interpretability and single-single is not that a flawed implementation of them would not work (which is obviously trivial), but that for the reductions to function, you need to solve a particularly ambitious form of the problem. And that we don’t currently have a good reason to expect to solve this ambitious problem with enough probability to warrant trusting the reduction and not working on anything else.
So an example of a plausible solution (of course I don’t have a good solution at hand) would be to create sufficient interpretability techniques that, when combined with conceptual and mathematical characterizations of problematic behaviours like deception, we’re able to see if a model will end up having these problematic behaviours. Notice that this possible solution requires working on conceptual alignment, which the reduction to interpretability would strongly discourage.
To summarize, I’m not claiming that interpretability (or single-single) won’t be enough if it’s flawed, just that reducing the alignment problem (or multi-multi) to them is actually a reduction to an incredibly strong and ambitious version of the problem, that no one is currently tackling this strong version, and that we have no reason to expect to solve the strong version with such high probability that we should shun alternatives and other approaches.
You define robustness to scaling down as “a solution to alignment keeps working if the AI is not optimal or perfect.” but for interpretability you talk about “our interpretability is merely good or great, but doesn’t capture everything relevant to alignment” which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?
That’s a great point!
There’s definitely one big difference between how Scott defined it and how I’m using it, which you highlighted well. I think a better way of explaining my change is that in Scott’s original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.
I hadn’t made the explicit claim in my head or in the post, but thanks to your comment, I think I’m claiming that the version I’m proposing generalize one of the interesting part of the original definition, and let it be applied to more settings.
As for your question, there is a difference between flawed and not the strongest version. What I’m saying about interpretability and single-single is not that a flawed implementation of them would not work (which is obviously trivial), but that for the reductions to function, you need to solve a particularly ambitious form of the problem. And that we don’t currently have a good reason to expect to solve this ambitious problem with enough probability to warrant trusting the reduction and not working on anything else.
So an example of a plausible solution (of course I don’t have a good solution at hand) would be to create sufficient interpretability techniques that, when combined with conceptual and mathematical characterizations of problematic behaviours like deception, we’re able to see if a model will end up having these problematic behaviours. Notice that this possible solution requires working on conceptual alignment, which the reduction to interpretability would strongly discourage.
To summarize, I’m not claiming that interpretability (or single-single) won’t be enough if it’s flawed, just that reducing the alignment problem (or multi-multi) to them is actually a reduction to an incredibly strong and ambitious version of the problem, that no one is currently tackling this strong version, and that we have no reason to expect to solve the strong version with such high probability that we should shun alternatives and other approaches.
Does that clarify your confusion with my model?
Yep, that clarifies.