(I might expand on this comment later but for now) I’ll point out that there are some pretty large literatures out there which seem at least somewhat relevant to these questions, including on causal models, identifiability and contrastive learning, and on neuroAI—for some references and thoughts see e.g.:
(I might expand on this comment later but for now) I’ll point out that there are some pretty large literatures out there which seem at least somewhat relevant to these questions, including on causal models, identifiability and contrastive learning, and on neuroAI—for some references and thoughts see e.g.:
https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default?commentId=8CngPZyjr5XydW4sC
https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp
And for some very recent potentially relevant work, using SAEs:
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models