The setup here implies a empirical (but conceptually tricky) research direction: try to take two different AIs trained to both do the same prediction task (e.g. predict next tokens of webtext) and try to correspond their internal structure in some way.
It’s a bit unclear to me what the desiderata for this research should be. I think we ideally want something like a “mechanistic correspondence”, something like a heuristic argument that the two models produce the same output distribution when given the same input.
Back when Redwood was working on model internals and interp, we were somewhat excited about trying to do something along these lines. Probably something trying to use automated methods to do a correspondence that seems accurate based on causal scrubbing.
(I haven’t engaged much with this post overall, I just thought this connection might be interesting.)
(I might expand on this comment later but for now) I’ll point out that there are some pretty large literatures out there which seem at least somewhat relevant to these questions, including on causal models, identifiability and contrastive learning, and on neuroAI—for some references and thoughts see e.g.:
The setup here implies a empirical (but conceptually tricky) research direction: try to take two different AIs trained to both do the same prediction task (e.g. predict next tokens of webtext) and try to correspond their internal structure in some way.
It’s a bit unclear to me what the desiderata for this research should be. I think we ideally want something like a “mechanistic correspondence”, something like a heuristic argument that the two models produce the same output distribution when given the same input.
Back when Redwood was working on model internals and interp, we were somewhat excited about trying to do something along these lines. Probably something trying to use automated methods to do a correspondence that seems accurate based on causal scrubbing.
(I haven’t engaged much with this post overall, I just thought this connection might be interesting.)
(I might expand on this comment later but for now) I’ll point out that there are some pretty large literatures out there which seem at least somewhat relevant to these questions, including on causal models, identifiability and contrastive learning, and on neuroAI—for some references and thoughts see e.g.:
https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default?commentId=8CngPZyjr5XydW4sC
https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp
And for some very recent potentially relevant work, using SAEs:
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models