It would be very convenient if only undesired plans were difficult and precise, while only desired plans were error-tolerant. I don’t think this is the case in the real world—it’s hard to dodge sifting desired from undesired plans based on semantic content.
I don’t think that’s the claim. Reward chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it’s a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I’m thinking of this similarly to Impact Regularization.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
It would be very convenient if only undesired plans were difficult and precise, while only desired plans were error-tolerant. I don’t think this is the case in the real world—it’s hard to dodge sifting desired from undesired plans based on semantic content.
I don’t think that’s the claim. Reward chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it’s a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I’m thinking of this similarly to Impact Regularization.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.