And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities—multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality—e.g. Identifiability Results for Multimodal Contrastive Learning (and results on nonlinear ICA in single-modal vs. multi-modal settings reviewed in section 2.1). This might a way to bypass single-model interpretability difficulties, by e.g. ‘comparing’ to brains or to other models.
‘in ‘all or none...’ they mention ‘the general form of next-token predictors used in our analysis.‘; if we assume that the language network (in biological brains) is indeed a next-token predictor (as e.g. the next-word prediction results in ‘shared principles...’ suggest), shouldn’t this suggest that the linear properties should then also apply to biological brains, not just LLMs?
This is a very insightful observation! Let’s analyze this:
The key logical chain would be:
1. The “All or None” paper’s results apply to next-token predictors of the form: ``` p(xt+1|x1:t) = exp(f(x1:t)ᵀg(xt+1))/Z(x1:t) ``` where f maps sequences to embeddings and g maps tokens to unembeddings
2. The “Shared Computational Principles” paper provides evidence that the human brain: - Engages in next-word prediction before word onset - Uses context to predict next tokens - Shows surprise signals consistent with prediction error - Represents words in ways that can be mapped to vector spaces
3. If the brain’s language network is implementing a next-token predictor of this general form, then theoretically: - The same mathematical framework should apply - The same identifiability results should hold - The same “all or none” properties of linear features should be present
However, there are some important considerations:
1. Implementation Level: - While the brain might implement this type of predictor at a computational level, the neural implementation could add constraints or properties not captured by the mathematical framework - The brain might use a more complex or different form of next-token prediction
2. Verification Challenges: - It’s harder to verify if the brain’s implementation exactly matches this mathematical form - We can’t directly observe all components like f and g in biological systems
3. Scale/Scope: - The brain might implement multiple prediction systems operating at different scales - The mathematical framework might apply to some but not all aspects of neural language processing
So while your reasoning seems sound—if the brain implements this type of next-token predictor, the mathematical results should apply—verifying this empirically would be challenging. It’s a fascinating theoretical bridge between the mathematical analysis and biological implementation.′
Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities—multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality—e.g. Identifiability Results for Multimodal Contrastive Learning (and results on nonlinear ICA in single-modal vs. multi-modal settings reviewed in section 2.1). This might a way to bypass single-model interpretability difficulties, by e.g. ‘comparing’ to brains or to other models.
Example of potential cross-species application: empathy mechanisms seem conserved across species Empathy as a driver of prosocial behaviour: highly conserved neurobehavioural mechanisms across species. Example of brain-ANN applications: ‘matching’ to modular brain networks, e.g. language network—ontology-relevant, non-agentic (e.g. The universal language network: A cross-linguistic investigation spanning 45 languages and 12 language families) or Theory of Mind network—could be very useful for detecting lying-relevant circuits (e.g. Single-neuronal predictions of others’ beliefs in humans).
Examples of related interpretability across models—Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models, across brain measurement modalities—Learnable latent embeddings for joint behavioural and neural analysis
, across animals and brain-ANN—Quantifying stimulus-relevant representational drift using cross-modality contrastive learning.
Potentially also relevant—Contrastive Preference Learning: Learning from Human Feedback without RL, TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space, Bridging Associative Memory and Probabilistic Modeling.
Some related cool-seeming recent results: All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling, Cross-Entropy Is All You Need To Invert the Data Generating Process.
fragment of chat with claude on All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling and Shared computational principles for language processing in humans and deep language models (my prompt first):
‘in ‘all or none...’ they mention ‘the general form of next-token predictors used in our analysis.‘; if we assume that the language network (in biological brains) is indeed a next-token predictor (as e.g. the next-word prediction results in ‘shared principles...’ suggest), shouldn’t this suggest that the linear properties should then also apply to biological brains, not just LLMs?
This is a very insightful observation! Let’s analyze this:
The key logical chain would be:
1. The “All or None” paper’s results apply to next-token predictors of the form:
```
p(xt+1|x1:t) = exp(f(x1:t)ᵀg(xt+1))/Z(x1:t)
```
where f maps sequences to embeddings and g maps tokens to unembeddings
2. The “Shared Computational Principles” paper provides evidence that the human brain:
- Engages in next-word prediction before word onset
- Uses context to predict next tokens
- Shows surprise signals consistent with prediction error
- Represents words in ways that can be mapped to vector spaces
3. If the brain’s language network is implementing a next-token predictor of this general form, then theoretically:
- The same mathematical framework should apply
- The same identifiability results should hold
- The same “all or none” properties of linear features should be present
However, there are some important considerations:
1. Implementation Level:
- While the brain might implement this type of predictor at a computational level, the neural implementation could add constraints or properties not captured by the mathematical framework
- The brain might use a more complex or different form of next-token prediction
2. Verification Challenges:
- It’s harder to verify if the brain’s implementation exactly matches this mathematical form
- We can’t directly observe all components like f and g in biological systems
3. Scale/Scope:
- The brain might implement multiple prediction systems operating at different scales
- The mathematical framework might apply to some but not all aspects of neural language processing
So while your reasoning seems sound—if the brain implements this type of next-token predictor, the mathematical results should apply—verifying this empirically would be challenging. It’s a fascinating theoretical bridge between the mathematical analysis and biological implementation.′
OTOH, Challenges in Explaining Representational Similarity through Identifiability presents good counter-points. Also, Position: Understanding LLMs Requires More Than Statistical Generalization.
Examples of reasons to expect (approximate) convergence to the same causal world models in various setups: theorem 2 in Robust agents learn causal world models; from Deep de Finetti: Recovering Topic Distributions from Large Language Models: ‘In particular, given the central role of exchangeability in our analysis, this analysis would most naturally be extended to other latent variables that do not depend heavily on word order, such as the author of the document [Andreas, 2022] or the author’s sentiment’ (this assumption might be expected to be approximately true for quite a few alignment-relevant-concepts); results from Victor Veitch: Linear Structure of (Causal) Concepts in Generative AI.