This is a useful idea, it acts to complement the omnipotence test where you ask if AI as a whole still does the right thing if it’s scaled up to an absurd degree (but civilization outside the AI isn’t scaled up, which is like its part for alignment purposes). In particular, any reflectively stable maximizer that’s not aimed exactly and with no approximations at CEV fails this because goodhart. The traditional answer is to aim it exactly, while the more recent answer is to prevent maximization at the decision theory level, so that acting very well is still not maximization.
Robustness to scaling down instead makes some parts of the system ineffectual, even if that shouldn’t plausibly happen, and considers what happens then, asks if the other parts would take advantage and cause trouble. What if civilization, seen as a part of the AI for purposes of alignment, holding its values, doesn’t work very well, would AI-except-civilization cause trouble?
I imagine the next step should have some part compromised by a capable adversary, acting purposefully to undermine the system. Robustness to catastrophic failure in a part of the design rather than to scaling down. This seems related to inner alignment and corrigibility: making sure parts don’t lose their purposes, while having the parts themselves cooperate with fixing their purposes and not acting outside their purposes.
This is a useful idea, it acts to complement the omnipotence test where you ask if AI as a whole still does the right thing if it’s scaled up to an absurd degree (but civilization outside the AI isn’t scaled up, which is like its part for alignment purposes). In particular, any reflectively stable maximizer that’s not aimed exactly and with no approximations at CEV fails this because goodhart. The traditional answer is to aim it exactly, while the more recent answer is to prevent maximization at the decision theory level, so that acting very well is still not maximization.
Robustness to scaling down instead makes some parts of the system ineffectual, even if that shouldn’t plausibly happen, and considers what happens then, asks if the other parts would take advantage and cause trouble. What if civilization, seen as a part of the AI for purposes of alignment, holding its values, doesn’t work very well, would AI-except-civilization cause trouble?
I imagine the next step should have some part compromised by a capable adversary, acting purposefully to undermine the system. Robustness to catastrophic failure in a part of the design rather than to scaling down. This seems related to inner alignment and corrigibility: making sure parts don’t lose their purposes, while having the parts themselves cooperate with fixing their purposes and not acting outside their purposes.