Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

The Platonic Representation Hypothesis (PRH) suggests that models trained with different objectives and on various modalities can converge to a shared statistical understanding of reality. While this is an intriguing idea, initial experiments in the paper focused on image-based models (like ViT) trained on the same pretraining (ImageNet) dataset. This raises an important question:

Does PRH hold only when models are trained on data from the same distribution?

To explore this question, the experiment was extended to ImageNet-O—a dataset specifically designed with out-of-distribution (OOD) images compared to ImageNet. Using ImageNet-O, the correlation analysis of alignment scores across various metrics for image classification models was re-evaluated.

The outcome? PRH holds true in the OOD setting as well, which challenges the notion that a shared data distribution is a prerequisite for this convergence. This observation carries significant implications for AI alignment research, suggesting that a deeper underlying structure may govern how models develop representations of reality, even when the training data differs.

Below are the correlation plots comparing results from the original in-distribution experiment to those with OOD data:

But does this mean that the model’s align even for purely randomly generated data?

The answer is NO.

This plot shows the correlation of the alignment scores for the models on purely randomly generated images.

The notebook documenting these experiments is available here.

What’s Next?

The next step in this research is to identify datasets or conditions where PRH might be falsifiable