I’ve not really seen it written up, but it’s conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.
I’ve not really seen it written up, but it’s conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.
After some searching, the most relevant thing I’ve found is Section 9 (page 44) of Interpretable machine learning: Fundamental principles and 10 grand challenges. Larger model classes often have bigger Rashomon sets and different models in the same Rashomon set can behave very differently.