I’ve not really seen it written up, but it’s conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.
Hmm, I haven’t heard of this kind of thing before; what should I read to learn more?
I’ve not really seen it written up, but it’s conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.
After some searching, the most relevant thing I’ve found is Section 9 (page 44) of Interpretable machine learning: Fundamental principles and 10 grand challenges. Larger model classes often have bigger Rashomon sets and different models in the same Rashomon set can behave very differently.