Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:
I’m not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
Robustness to scaling up and robustness to scaling down sounds like they can be summarized by: “does it break in the limit of optimality? and “does it only work in the limit of optimality?”. Where the first gives us an approximation for studying and designing alignment proposals, and the second points out a potential issue in this approximation. (Not saying that this is capturing all of your meaning, though)
Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:
I’m not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
Robustness to scaling up and robustness to scaling down sounds like they can be summarized by: “does it break in the limit of optimality? and “does it only work in the limit of optimality?”. Where the first gives us an approximation for studying and designing alignment proposals, and the second points out a potential issue in this approximation. (Not saying that this is capturing all of your meaning, though)