’Could we do concept algebra directly on the fMRI of the higher visual cortex? Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image. Here’s a hypothetical workflow:
Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for “man” vs “woman”, keeping other aspects of the image content constant.
Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.
The main challenges would be:
Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.
Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.‘s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.’
’How might the above be relevant to the AI alignment problem? The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI’s outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI’s outputs to align with these values.
Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.
However, it’s important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.′
’How would this be relevant in the context of this proposal? [note: the proposal beingActivation vector steering with BCI] The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model’s latent space, we could potentially “steer” the language model to exhibit brain states similar to specific people.
The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.
In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal’s goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.′
’Above you wrote: ‘Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.’ What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra? Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the “sex direction” is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA. Here’s why this approach makes sense:
If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
By using a large and diverse dataset, you’re essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.
To implement this:
Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
Compute the “sex direction” as: v_sex = m_women—m_men.
To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.
The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn’t require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women. Potential limitations to keep in mind:
If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
The approach assumes linearity—that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.
Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It’s a great starting point that could be refined with more sophisticated methods if needed.′
Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have ‘read the whole internet’ and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it’s hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).
Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).
My prompts bolded:
’Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image.
Here’s a hypothetical workflow:
Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for “man” vs “woman”, keeping other aspects of the image content constant.
Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.
The main challenges would be:
Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.
Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.‘s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.’
’How might the above be relevant to the AI alignment problem?
The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI’s outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI’s outputs to align with these values.
Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.
However, it’s important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.′
’How would this be relevant in the context of this proposal? [note: the proposal being Activation vector steering with BCI]
The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model’s latent space, we could potentially “steer” the language model to exhibit brain states similar to specific people.
The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.
In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal’s goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.′
’Above you wrote: ‘Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.’ What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra?
Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the “sex direction” is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA.
Here’s why this approach makes sense:
If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
By using a large and diverse dataset, you’re essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.
To implement this:
Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
Compute the “sex direction” as: v_sex = m_women—m_men.
To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.
The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn’t require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women.
Potential limitations to keep in mind:
If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
The approach assumes linearity—that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.
Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It’s a great starting point that could be refined with more sophisticated methods if needed.′
More reasons to think something like the above should work: High-resolution image reconstruction with latent diffusion models from human brain activity literally steers diffusion models using linearly-decoded fMRI signals (see fig. 2); and linear encoding (the inverse of decoding) from the text latents to fMRI also works well (see fig. 6; and similar results in Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, e.g. fig. 2). Furthermore, they use the same (Stable Diffusion with CLIP) model used in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which both provides theory and demo empirically activation engineering-style linear manipulations. All this suggests similar Concept Algebra for (Score-Based) Text-Controlled Generative Models—like manipulations would also work when applied directly to the fMRI representations used to decode the text latents c in High-resolution image reconstruction with latent diffusion models from human brain activity.
Turns out, someone’s already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7
Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have ‘read the whole internet’ and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it’s hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).