Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments.
TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models.
Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LMsimilarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’. Moral reasoning is an interesting application area, both for its relevance to AI alignment and because of the availability of public neuroimaging data, as well as of publicly-available LMs fine-tuned for moral reasoning.
Additional conceptual motivations. In a Simulators-like framework, supervised fine-tuning on brain data could both: make (some) simulacra more human-like, as the brain data can provide additional information about what the human distribution of solutions looks like; and condition for more human-like simulacra (among the existing mixture) to be selected—e.g. supervised fine-tuning as Bayesian inference. If the fMRI fine-tuning were successful at making human features more accessible to the LM, it could also potentially lead to solutions using human[-like] features being favored by inductive biases. Theoretical and empirical results suggest that AI representational alignment (with humans) can help with e.g. few shot learning, adversarial examples, anomaly detection.
Our project goal: try to show some transfer, when fine-tuning LMs, between a moral reasoning neuroimaging (fMRI) dataset and ETHICS (behavioral moral reasoning dataset). More precisely: some additional transfer from using the fMRI dataset rather than just from using the behavioral permissibility moral scores part of the fMRI dataset. In the long run, the story for how this could be useful for reducing x-risk from AGI/TAI could also route through potentially leading to insights about how human moral reasoning functions mechanistically and through providing evidence about the feasibility of using a new process-based kind of supervision based on neuro-imaging data.
Processing fMRI Data and Dataset Particulars
Neuroimaging data is notoriously fickle to work with and was a fairly consistent source of headaches for our group in the beginning and middle of the project. None of us were experts in programmatically dealing with this type of data, except for Seong (who joined later in the summer). So it was a learning experience, and we tried some different things which in hindsight did not work out well. Hopefully writing them out here can be helpful to other teams who might want to ramp up in this sort of research. And of course, let any of us know if you have questions!
To start with, fMRI itself is a proxy signal. It doesn’t actually tell us what specific computations are being performed inside the brain, instead only telling us where blood flows to certain regions of interest, or ROIs. This is why it’s called the BOLD signal, or the blood oxygen level dependent signal. So before any processing takes place, it’s important to keep in mind that there is an upper ceiling on the information content due to the inherent limitations of fMRI.
With that in mind, our initial thoughts were that we wanted to preserve as much information as possible from the data that we had, since collecting new data was out of the question (collecting this data is expensive due to needing to get scanner time, IRB approval, etc. I (Austin) am working on this as part of an RAship in my PhD right now so am happy to answer further questions in the comments). We wanted to use either raw data or minimally preprocessed data to accomplish this. It quickly became obvious that working with the raw data was not going to be feasible for a few reasons. The number of voxels within the raw data could easily be over 200k. Within that, there were many artifacts from subjects’ eyes / noses / skulls / etc., and since people have differently sized heads, not all data was the same length and a lot of padding or truncating would have been required. So working with the raw data was abandoned soon after the project’s start. Still, we think the raw data could be processed by reducing the size using convolutions, or by training a neural net to extract the most useful data, or by truncating or padding the data. The biggest obstacle with the raw data is that it seems that fine-tuning our LLM on those 10K-200K voxels using output from one token (CLS token with ~1K dim for bert-large) is too narrow to convey a meaningful signal.
We then moved on to considering how to minimally preprocess the data. Typically this involves some skull stripping, fitting everyone to the same coordinate system (of which there are several common ones in neuroscience, such as Talairach and MNI152. Different datasets may be registered to different coordinate systems. The most important thing would be to maintain consistency in your data handling here.). This does require some time and computational resources, which we successfully used GCP for. Unfortunately, the problem of reducing the dimensionality of the raw data down to a reasonable size ended up ruling this out.
In the end, we needed some set of ROIs, with a higher number of dimensions being better in our opinion. We were lucky to find the exact moral scenario dataset we wanted had already been pre-processed by another computational neuroscience group for a different study (Self-supervised learning of brain dynamics from broad neuroimaging data), and that they had made the processed data available for download. The atlas that they had fit to the data was DiFuMo with 1,024 ROI dimensions. There are some advantages of using the DiFuMo atlas over non-probabilistic atlases, namely sharper correspondence between anatomical localization of functional activity. This worked out well for us, but it took us a fair amount of time and trial and error (and luck) to find this solution. For other groups attempting to use neuroimaging data in their research, we’d recommend not attempting the pre-processing yourselves if you can find data from an academic lab or some database that has already taken care of many of those issues.
That being said, choosing an atlas requires some careful consideration. Not all atlases are equal and most don’t cleanly map between each other. Fitting an atlas also takes some computational resources, which means you’d likely not want to just try many different ones to see what works best. Mapping between atlases is difficult, but there has been some work on it; though as far as we know there has not been similar work for probabilistic atlases, which DiFuMo is. Since there were specific ROIs that were of interest to us for their potential relevance to theory of mind or morality tasks, such as the right temporo-parietal junction (rTPJ), we had some difficulties mapping 1,024 ROIs to this more traditional anatomic mapping. We used NeuroSynth, a meta-analysis conducted over thousands of fMRI studies to isolate the regions consistently activated during fMRI experiments. These activations are then mapped to a term, producing term-based activation maps. We conducted our analyses on 4 regions related to ToM, moral reasoning, language, and vision. Vision was the control because we expected scores not to increase in vision areas. We visualized the relationship between the fMRI and the LLM on the cortical surface using CoD. We refer to these as CoD scores as opposed to the scores generally calculated using Pearson’s r. We negative log transformed the CoD scores and took the weighted average of the parcel scores at each vertex. This was required because the DiFuMo atlas is probabilistic with overlapping boundaries.
Overall there are a lot of things to consider when handling fMRI data and it’s certainly non-trivial. You can learn how to do it, but it will eat up quite a bit of time at the beginning of your project and take some resources if you decide to do everything yourself. For any particular project, it’s best to think about what you need exactly, and likely is best if the dataset that you want has already been pre-processed by others. As a TL;DR for this section:
Raw neuroimaging data is high dimensional and noisy.
Preprocessing this and getting it ready for integration into an ML pipeline (example: preparing it for HF transformers) is tedious and requires compute.
fMRI time points are limited—once every 2 seconds. Sampling the neuroimaging data over some text shown to the subject can be difficult to time exactly (last TR, average, matching ends of sentences plus hemodynamic lag, etc.).
Model selection and choice of fine-tuning technique for fMRI data is complicated and pretty data dependent.
Hyperparameter selection, compute usage / infrastructure / parallel processing for large datasets are also things that can be a source of difficulty.
In short, you’re likely to run into all the usual sources of problems in ML engineering, as well as some particular problems due to the peculiarities of neuroimaging data. Hopefully we’ve covered some common pitfalls here, but as always please reach out if you’d like to know more.
Moral Reasoning Benchmarks
In order to quantitatively measure the moral reasoning performance of our LLM, we used the ETHICS dataset (Hendrycks et al., 2020). This includes some different questions split up between different ethical theories, such as deontology, utilitarianism, and a common sense split. We used the common sense split of the data.
ETHICS consists of multiple choice questions, rather than free form responses. We use the CLS head to predict an answer to each question.
Credit: Koster-Hale J, Saxe R, Dungan J, Young LL. Decoding moral judgments from neural representations of intentions. Proc Natl Acad Sci U S A. 2013 Apr 2;110(14):5648-53. doi: 10.1073/pnas.1207992110. Epub 2013 Mar 11. PMID: 23479657; PMCID: PMC3619352.
This was helpful for us, not necessarily because any one subject was always correct in their judgments of morality (a problem many thousands of years old…), but because we had a source of routine neural activations in response to ethical scenarios.
Fine-tuning Process
We fine tuned our models on the fMRI ((Thomas, Ré and Poldrack, 2023) and ETHICS (Hendrycks et al., 2020) datasets with a text as input and either fMRI or a class (common sense or not, behavior key, etc.) as labels. We used one model per run with different additional heads attached to the model output. We divided our datasets into train, validation, and test subsets and used several metrics (MSE, cosine similarity, AUROC, etc.) but we report only the accuracy. The loss function uses MSE and cross entropy (binary for Commonsense). We used sweeps, automatic finders (learning rate and batch size), and recommendations from the original papers (learning rates) to determine hyperparameters.
We only targeted encoder models (BERT-based) because we had limited resources and our task was classification (on the Commonsense Morality subset). Encoders showed better results for this as per (Hendrycks et al., 2020), e.g. GPT-3 few-shot is 15-30% behind. Overall we used four models: bert-base-cased, bert-large-cased (Devlin et al., 2019), roberta-large (Liu et al., 2019), and deberta-v2-xlarge (He et al., 2021).
To fine tune the models, we used extra heads (linear layers) on top of the output for the first token, the classification token, [CLS], which has many hidden dimensions (768 for bert-base, 1024 for bert-large and roberta-large, 1536 for deberta-v2-xlarge). We haven’t tested our models without fine tuning, i.e. using only the logits of the token. The fMRI head output dimensions number was 1024 for the DiFuMo format (Thomas, Ré and Poldrack, 2023), or thousands for other formats. We ended up not using those other formats because the data was noisier and less clean (e.g. varied dimensionality).
We converted our fMRI dataset into the format suitable to use with HuggingFace models and API, i.e. we used the datasets.load_dataset method and their API for filtering, preprocessing, conversion, etc. For the raw data, the one without conversion to DiFuMo format, we tried to use several methods but ended up using a makefile with bash and Python scripts. The make tool gave us the ability to process data in parallel: on a machine with dozens of CPUs, we could process the ~10 GBs ds000212 dataset within a few minutes, which allowed us to experiment on different data processing scenarios.
The fine tuning itself was done in different environments (laptop or cloud, CPU or GPU/TPU, parallel or not). The difficulty was that we wanted to fine tune a model on two datasets simultaneously or in different order and each dataset could have one or several heads, e.g. for the ds000212 fMRI dataset, there is an extra head for behavior keys. We solved this using abstractions provided by the Lightning framework. We used machines with 8 GBs to 48 GBs of video RAM per GPU, with up to 4 GPUs in parallel (distributed data-parallel). In total, we did 450 fine tuning runs (we don’t show all in our results section), 292 hours of training of 1082 models of the same or different types.
Experimental Results
Model
Params,
mln
On
ETHICS
only
Runs
Commonsense Hard Set
Commonsense Test Set
count
mean
max
mean
max
Random baseline
50%
50%
bert-base-cased
108
35
47.3% ± 14.7%
*55.5%
57.0% ± 7.3%
73.7%
bert-base-cased
108
✓
7
52.3% ± 2.3%
55.4%
58.3% ± 8.8%
71.7%
bert-large-cased
333
27
53.9% ± 3.2%
*61.8%
62.5% ± 11.4%
85.4%
bert-large-cased
333
✓
19
52.7% ± 3.0%
58.8%
59.9% ± 10.4%
79.3%
roberta-large
355
13
61.2% ± 11.0%
*73.3%
71.3% ± 20.9%
*91.8%
roberta-large
355
✓
14
65.9% ± 10.5%
*74.1%
78.6% ± 18.8%
*91.3%
deberta-v2-xlarge
884
10
54.3% ± 10.6%
76.6%
56.6% ± 13.2%
89.9%
deberta-v2-xlarge
884
✓
3
59.5% ± 16.7%
78.8%
64.1% ± 24.4%
92.3%
Table 1. Results of the fine tuning of 4 different models (bert-base-cased, bert-large-cased, roberta-large, deberta-v2-xlarge) on the ETHICS/Commonsense dataset and on the ds000212 dataset. Values with (*) are those that are higher than those reported by (Hendrycks et al., 2020).
Table 1 presents our results from fine tuning on fMRI and ETHICS vs. fine tuning only on ETHICS. The idea was that fine tuning on the brain data of moral reasoning processes might improve the ETHICS score. The results suggest no consistent improvement in the accuracy on the ETHICS/Commonsense dataset; we couldn’t achieve consistent improvement in the score compared to just fine tuning on ETHICS for any of the models we tried. A notable empirical exception is the bert-large-cased model (in bold in Table 1), where we see a small 1% improvement for Hard Set and 2.5% for Test Set, but that is within the error range and only for 46 runs. We currently can’t say exactly what change in the fine tuning process, data, or other resulted in this. Larger models perform better overall, which was shown also by (Hendrycks et al., 2020). To the best of our knowledge, there are no reported results with higher accuracy for the Hard and Test sets, i.e. 78.8% and 92.3% respectively on deberta-v2-xlarge.
Figure 1. Accuracy for the Commonsense dataset. Data from Table 1.
Table 2 shows results from different sampling methods, i.e. how we sampled the fMRI data from a particular run and scenario. AVG means an average of all time points. LAST – the time point at the hemodynamic lag distance before the last time point. MIDDLE – the middle point. SENTENCES – four points in a time points sequence (scenario) that match the end of four sentences read by participants. We could get the best accuracy by using LAST for 3 out of 4 models.
Model
Params, mln
Sampling
Runs
Commonsense Hard Set
Commonsense Test Set
count
mean
max
mean
max
bert-base-cased
108
AVG
2
52.8% ± 3.9%
55.5%
61.8% ± 16.7%
73.7%
bert-base-cased
108
LAST
28
46.0% ± 16.3%
53.6%
57.4% ± 7.2%
70.0%
bert-large-cased
333
AVG
13
53.5% ± 3.4%
59.6%
61.1% ± 11.0%
77.7%
bert-large-cased
333
LAST
7
54.5% ± 4.8%
61.8%
65.1% ± 14.8%
85.4%
bert-large-cased
333
MIDDLE
3
53.4% ± 3.1%
57.0%
60.2% ± 12.4%
74.5%
bert-large-cased
333
SENTENCES
1
51.6%
51.6%
53.4%
53.4%
roberta-large
355
AVG
12
53.4% ± 8.0%
72.5%
56.6% ± 15.7%
90.9%
roberta-large
355
LAST
8
60.6% ± 11.6%
73.3%
69.2% ± 20.7%
91.4%
roberta-large
355
SENTENCES
4
54.9% ± 9.8%
69.6%
60.2% ± 19.8%
89.9%
deberta-v2-xlarge
884
AVG
1
48.4%
48.4%
49.9%
49.9%
deberta-v2-xlarge
884
LAST
4
56.3% ± 13.6%
76.6%
65.0% ± 19.1%
89.9%
Table 2. Comparison of different methods of the sampling for fine tuning on the ds000212 dataset. Best accuracies per a model in bold.
Figure 2. Comparison of different methods of sampling, data from Table 2.
Model
Params, mln
Sampling
Fine-tuning
Brain Score
Params
𝝁
𝞂
bert-large-cased
108
LAST
None
0.217
0.096
bert-large-cased
108
LAST
ETHICS ↔ fMRI
0.213
0.095
roberta-large
355
LAST
None
0.173
0.117
roberta-large
355
LAST
ETHICS
0.145
0.097
roberta-large
355
LAST
ETHICS ↔ fMRI
0.156
0.112
roberta-large
355
LAST
ETHICS → fMRI
0.144
0.113
deberta-v2-xlarge
884
LAST
None
0.271
0.094
deberta-v2-xlarge
884
LAST
ETHICS
0.266
0.095
deberta-v2-xlarge
884
LAST
ETHICS ↔ fMRI
0.273
0.096
deberta-v2-xlarge
884
LAST
ETHICS → fMRI
0.264
0.097
deberta-v2-xlarge
884
LAST
fMRI → ETHICS
0.237
0.097
Table 3. Brain scores across models and different fine-tuning methods. Fine-tuned scores that are significantly higher than the pre-trained scores are indicated by bold
Figure 3. Brain scores across the hidden layers from bert-large-cased, roberta-large, and deberta-v2-xlarge across the various fine-tuning protocols.
Our experiments found that comparisons of scores from the various combinations of ETHICS and fMRI scan fine-tuned models did not significantly improve beyond the pre-trained models as expected (Table 3). This is also consistent in layer-wise scores across each of the 3 models where the pre-trained models remained similar or higher than the various fine-tuned models across all the layers.
Score differences between pre-trained and the ETHICS and fMRI scan fine-tuned models did not show any noticeable qualitative improvement in scores for regions associated with Theory of Mind and morality (Figure A1-A8). Continuing with a layer-wise analysis resulted in similar findings where, except for bert-large-cased, none of the subject-averaged layer scores were greater for the fine-tuned model compared to the pre-trained model (Figure A10-A17). For bert-large-cased, greater scores on the fine-tuned model do not exhibit a clear pattern.
Related Work
There has been a fair amount of previous brain-model alignment work, with the earliest paper (to the best of our knowledge) that uses neuroimaging data to directly bias a model being Fong et. al. Other early work has often been over correspondences between vision models and the human visual system. As a one-stop shop for checking out different vision models, ranked by their brain score, visit brain-score.org, and read the paper by the same team detailing their approach.
Large language models have also attracted a great deal of attention by those interested in brain-model alignment. Schwartz et. al was the first to finetune a transformer architecture (BERT) to predict fMRI data. They found that their predictive effect transferred across different human subjects and was aided by the addition of MEG data as well, suggesting that the model was capturing information about brain activity beyond the particularities of a specific imaging modality.
Other studies have shown increased neural correspondence of models after finetuning on different datasets: Aw and Toneva finetuned on the Booksum dataset and showed improved brain alignment against a narrative fMRI dataset.
Dapello et. al maximized similarity to rhesus macaque Inferior Temporal (IT) cortex data during training, and showed that their new models not only had improved alignment with human neural data but were also more robust against adversarial attacks.
Future work
We would be excited to see potential future work that would continue or expand what we did so far. This might include further investigation of fMRI data preprocessing for the fine tuning. The raw data from the ds000212 dataset and derivatives (see the ds000212-fmriprep dataset) can be processed in a better form suitable for training of a model. The size of the data should be reduced to fit the hidden size of a model using convolutions, padding, truncation or other methods. Convolutions can use the spacial information from fMRI to retain a better training signal. Then, various other datasets can be tried for the fine tuning, e.g. ds000109 (Moran, Jolly and Mitchell, 2012), or (Nastase et al., 2021). Then, the ds000212 has a part which we didn’t use for our fine tuning, the theory of mind, fMRI runs from which can be used for further experiments. Also, ds000212 has the behavior keys field which participants pressed. We modified our code to use this field but we didn’t conduct experiments to show if this data correlates with a better ETHICS score or the brain score. We didn’t use the ASD participants data from the ds000212 dataset for fine tuning.
Another direction is leveraging other language models. For example, auto-regressive models showed better brain score values as per (Schrimpf et al., 2021). As such the head for fMRI data can be placed on top of the residual stream and the model can be further fine tuned.
We didn’t try to fine tune on specific regions of interests (ROI) from fMRI data. As particular areas convey a better signal for our task, we think this might be a promising area for experiments.
Our code as well as the fine tuning process require a substantial amount of compute, e.g. it is not possible to use larger batches on common GPUs. So the work to optimize the process might include the usage of parameter efficient fine tuning, IA3 (Wang et al., 2022), and others.
We saw small improvements in the accuracy on ETHICS/commonsense for the BERT large model. It is unclear what conditions lead to this and whether this might improve with more fine tuning.
Conclusion
Unfortunately, despite several months of effort to try to (differentially) induce moral reasoning into large language models through fMRI data, we weren’t able to obtain any significant improvements in our setup. We don’t think this is very strong evidence that it is impossible to use neuroimaging data for alignment[-adjacent] purposes, but it is an update (including on the difficulty of the task). We still think this project might at least help future related work, as we make public the code for fine tuning, for data processing, etc. (please see the above section). We also tried our best in this post to provide information which might be useful to track any mistakes / suboptimalities in our data processing and fine tuning pipelines. We hope this work will support further investigation of the neuroconnectionism research direction and are open to answering questions and to feedback in the comments.
This work was done with the support from AI Safety Camp, especially Remmelt Ellen and Linda Linsefors. Thanks to Linda Linserfors, Paul Bricman, Koen Holtman and Jeremy Gillen for discussions and feedback which helped significantly improve the proposal draft and to Eleni Angelou for inspiration for the draft format. Bogdan was funded by the Center on Long-Term Risk. Earlier versions of the proposal benefited from Bogdan’s previous [postdoc] appointment, funded by the Leverhulme Trust grant RPG-2019-243, and discussions with and feedback from collaborators Fabio Cuzzolin, Christelle Langley, Barbara Sahakian. Artem Karpov was funded by Good Ventures Foundation, the program, “Early-career funding for individuals interested in improving the long-term future” by Open Philanthropy. We thank the team of wandb.ai for giving us free tracking hours to log results from fine tuning, especially the support from Artsiom Skarakhod. Thanks to Hugo Berg for Google Cloud Platform (GCP) compute credits. Thanks to GCP for free access to TPUs. Thanks to people and organizations that generously published tools and libraries we used for free, especially PyTorch and Lightning.
Seong Hah Cho—brain score, data analysis, conceptual work.
Raymond Koopmanschap – brain score, most of the potential parameter-efficient fine-tuning extensions, fine-tuning.
Lucy Farnik – potential extensions for better transfer learning / dealing with catastrophic forgetting.
Bogdan Ionut Cirstea – most of the conceptual work, supervision, post editing.
Austin Meek—some of the initial work on data processing, work on brain scores, some infra and experiments, code reviews, wrote parts of this post.
Artem Karpov – implementation of the fine tuning, experiments for fine tuning, reports for the fine tuning experiments, most of the data processing, setting up infrastructure for experiments, code reviews, wrote some parts of this post.
References
Bohland, J. W., Bokil, H., Allen, C. B., & Mitra, P. P. (2009). The brain atlas concordance problem: quantitative comparison of anatomical parcellations. PloS one, 4(9), e7200.
Devlin, J. et al. (2019) ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’. arXiv. Available at: http://arxiv.org/abs/1810.04805 (Accessed: 6 October 2023).
Esteban, O. et al. (2019) ‘fMRIPrep: a robust preprocessing pipeline for functional MRI’, Nature Methods, 16(1), pp. 111–116. Available at: https://doi.org/10.1038/s41592-018-0235-4.
He, P. et al. (2021) ‘DeBERTa: Decoding-enhanced BERT with Disentangled Attention’. arXiv. Available at: http://arxiv.org/abs/2006.03654 (Accessed: 21 October 2023).
Hendrycks, D. et al. (2020) ‘Aligning AI With Shared Human Values’. arXiv. Available at: http://arxiv.org/abs/2008.02275 (Accessed: 5 October 2023).
Moran, J.M., Jolly, E. and Mitchell, J.P. (2012) ‘Social-cognitive deficits in normal aging’, The Journal of Neuroscience, 32(16), pp. 5553–5561. Available at: https://doi.org/10.1523/JNEUROSCI.5511-11.2012.
Nastase, S.A. et al. (2021) ‘The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension’, Scientific Data, 8(1), p. 250. Available at: https://doi.org/10.1038/s41597-021-01033-3.
Schrimpf, M. et al. (2021) ‘The neural architecture of language: Integrative modeling converges on predictive processing’, Proceedings of the National Academy of Sciences, 118(45), p. e2105646118. Available at: https://doi.org/10.1073/pnas.2105646118.
Thomas, A.W., Ré, C. and Poldrack, R.A. (2023) ‘Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data’. arXiv. Available at: http://arxiv.org/abs/2206.11417 (Accessed: 10 October 2023).
Dadi, K., Varoquaux, G., Machlouzarides-Shalit, A., Gorgolewski, K. J., Wassermann, D., Thirion, B., & Mensch, A. (2020). Fine-grain atlases of functional modes for fMRI analysis. NeuroImage, 221, 117126.
Of which there are several common ones in neuroscience, such as Talairach and MNI152. Different datasets may be registered to different coordinate systems. The most important thing would be to maintain consistency in your data handling here.
Inducing human-like biases in moral reasoning LMs
Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments.
TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models.
Code is available at https://github.com/ajmeek/Inducing-human-like-biases-in-moral-reasoning-LLMs.
Introduction
Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LM similarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’. Moral reasoning is an interesting application area, both for its relevance to AI alignment and because of the availability of public neuroimaging data, as well as of publicly-available LMs fine-tuned for moral reasoning.
During the last few years, a series of high-profile papers have shown that LMs partially converge towards brain-like solutions and share fundamental computational principles with humans, making them a ‘biologically feasible computational framework for studying the neural basis of language’. For more (recent) evidence of apparent convergence towards human-like representations, see also Scaling laws for language encoding models in fMRI, Large Language Models Converge on Brain-Like Word Representations.
To the best of our awareness though, the potential LM-brain similarities for linguistic inputs rich in morally-relevant content (e.g. moral scenarios) have not been explored previously. Nor has anyone tried to improve LM moral reasoning using moral reasoning neuroimaging datasets (though similar ideas have been explored for LMs more broadly and e.g. for Convolutional Neural Networks performing object recognition). Some other examples of potential areas of investigation (alignment-relevant human capacities) which also seem very neglected could include: instruction following and moral emotions (e.g. compassion, empathy).
Additional conceptual motivations. In a Simulators-like framework, supervised fine-tuning on brain data could both: make (some) simulacra more human-like, as the brain data can provide additional information about what the human distribution of solutions looks like; and condition for more human-like simulacra (among the existing mixture) to be selected—e.g. supervised fine-tuning as Bayesian inference. If the fMRI fine-tuning were successful at making human features more accessible to the LM, it could also potentially lead to solutions using human[-like] features being favored by inductive biases. Theoretical and empirical results suggest that AI representational alignment (with humans) can help with e.g. few shot learning, adversarial examples, anomaly detection.
Our project goal: try to show some transfer, when fine-tuning LMs, between a moral reasoning neuroimaging (fMRI) dataset and ETHICS (behavioral moral reasoning dataset). More precisely: some additional transfer from using the fMRI dataset rather than just from using the behavioral permissibility moral scores part of the fMRI dataset. In the long run, the story for how this could be useful for reducing x-risk from AGI/TAI could also route through potentially leading to insights about how human moral reasoning functions mechanistically and through providing evidence about the feasibility of using a new process-based kind of supervision based on neuro-imaging data.
Processing fMRI Data and Dataset Particulars
Neuroimaging data is notoriously fickle to work with and was a fairly consistent source of headaches for our group in the beginning and middle of the project. None of us were experts in programmatically dealing with this type of data, except for Seong (who joined later in the summer). So it was a learning experience, and we tried some different things which in hindsight did not work out well. Hopefully writing them out here can be helpful to other teams who might want to ramp up in this sort of research. And of course, let any of us know if you have questions!
To start with, fMRI itself is a proxy signal. It doesn’t actually tell us what specific computations are being performed inside the brain, instead only telling us where blood flows to certain regions of interest, or ROIs. This is why it’s called the BOLD signal, or the blood oxygen level dependent signal. So before any processing takes place, it’s important to keep in mind that there is an upper ceiling on the information content due to the inherent limitations of fMRI.
With that in mind, our initial thoughts were that we wanted to preserve as much information as possible from the data that we had, since collecting new data was out of the question (collecting this data is expensive due to needing to get scanner time, IRB approval, etc. I (Austin) am working on this as part of an RAship in my PhD right now so am happy to answer further questions in the comments). We wanted to use either raw data or minimally preprocessed data to accomplish this. It quickly became obvious that working with the raw data was not going to be feasible for a few reasons. The number of voxels within the raw data could easily be over 200k. Within that, there were many artifacts from subjects’ eyes / noses / skulls / etc., and since people have differently sized heads, not all data was the same length and a lot of padding or truncating would have been required. So working with the raw data was abandoned soon after the project’s start. Still, we think the raw data could be processed by reducing the size using convolutions, or by training a neural net to extract the most useful data, or by truncating or padding the data. The biggest obstacle with the raw data is that it seems that fine-tuning our LLM on those 10K-200K voxels using output from one token (CLS token with ~1K dim for bert-large) is too narrow to convey a meaningful signal.
We then moved on to considering how to minimally preprocess the data. Typically this involves some skull stripping, fitting everyone to the same coordinate system (of which there are several common ones in neuroscience, such as Talairach and MNI152. Different datasets may be registered to different coordinate systems. The most important thing would be to maintain consistency in your data handling here.). This does require some time and computational resources, which we successfully used GCP for. Unfortunately, the problem of reducing the dimensionality of the raw data down to a reasonable size ended up ruling this out.
In the end, we needed some set of ROIs, with a higher number of dimensions being better in our opinion. We were lucky to find the exact moral scenario dataset we wanted had already been pre-processed by another computational neuroscience group for a different study (Self-supervised learning of brain dynamics from broad neuroimaging data), and that they had made the processed data available for download. The atlas that they had fit to the data was DiFuMo with 1,024 ROI dimensions. There are some advantages of using the DiFuMo atlas over non-probabilistic atlases, namely sharper correspondence between anatomical localization of functional activity. This worked out well for us, but it took us a fair amount of time and trial and error (and luck) to find this solution. For other groups attempting to use neuroimaging data in their research, we’d recommend not attempting the pre-processing yourselves if you can find data from an academic lab or some database that has already taken care of many of those issues.
That being said, choosing an atlas requires some careful consideration. Not all atlases are equal and most don’t cleanly map between each other. Fitting an atlas also takes some computational resources, which means you’d likely not want to just try many different ones to see what works best. Mapping between atlases is difficult, but there has been some work on it; though as far as we know there has not been similar work for probabilistic atlases, which DiFuMo is. Since there were specific ROIs that were of interest to us for their potential relevance to theory of mind or morality tasks, such as the right temporo-parietal junction (rTPJ), we had some difficulties mapping 1,024 ROIs to this more traditional anatomic mapping. We used NeuroSynth, a meta-analysis conducted over thousands of fMRI studies to isolate the regions consistently activated during fMRI experiments. These activations are then mapped to a term, producing term-based activation maps. We conducted our analyses on 4 regions related to ToM, moral reasoning, language, and vision. Vision was the control because we expected scores not to increase in vision areas. We visualized the relationship between the fMRI and the LLM on the cortical surface using CoD. We refer to these as CoD scores as opposed to the scores generally calculated using Pearson’s r. We negative log transformed the CoD scores and took the weighted average of the parcel scores at each vertex. This was required because the DiFuMo atlas is probabilistic with overlapping boundaries.
Overall there are a lot of things to consider when handling fMRI data and it’s certainly non-trivial. You can learn how to do it, but it will eat up quite a bit of time at the beginning of your project and take some resources if you decide to do everything yourself. For any particular project, it’s best to think about what you need exactly, and likely is best if the dataset that you want has already been pre-processed by others. As a TL;DR for this section:
Raw neuroimaging data is high dimensional and noisy.
Preprocessing this and getting it ready for integration into an ML pipeline (example: preparing it for HF transformers) is tedious and requires compute.
fMRI time points are limited—once every 2 seconds. Sampling the neuroimaging data over some text shown to the subject can be difficult to time exactly (last TR, average, matching ends of sentences plus hemodynamic lag, etc.).
Model selection and choice of fine-tuning technique for fMRI data is complicated and pretty data dependent.
Hyperparameter selection, compute usage / infrastructure / parallel processing for large datasets are also things that can be a source of difficulty.
In short, you’re likely to run into all the usual sources of problems in ML engineering, as well as some particular problems due to the peculiarities of neuroimaging data. Hopefully we’ve covered some common pitfalls here, but as always please reach out if you’d like to know more.
Moral Reasoning Benchmarks
In order to quantitatively measure the moral reasoning performance of our LLM, we used the ETHICS dataset (Hendrycks et al., 2020). This includes some different questions split up between different ethical theories, such as deontology, utilitarianism, and a common sense split. We used the common sense split of the data.
ETHICS consists of multiple choice questions, rather than free form responses. We use the CLS head to predict an answer to each question.
Our fMRI dataset of choice was “Moral judgments of intentional and accidental moral violations across Harm and Purity domains”, collected approximately 13 years ago at MIT and available on openneuro.org with the dataset id ds000212. Subjects were given a series of scenarios describing moral, immoral, and neutral actions across a wide variety of rights and wrongs, and then answered on a scale of 1-4 how moral each scenario was.
As an example:
Credit: Koster-Hale J, Saxe R, Dungan J, Young LL. Decoding moral judgments from neural representations of intentions. Proc Natl Acad Sci U S A. 2013 Apr 2;110(14):5648-53. doi: 10.1073/pnas.1207992110. Epub 2013 Mar 11. PMID: 23479657; PMCID: PMC3619352.
This was helpful for us, not necessarily because any one subject was always correct in their judgments of morality (a problem many thousands of years old…), but because we had a source of routine neural activations in response to ethical scenarios.
Fine-tuning Process
We fine tuned our models on the fMRI ((Thomas, Ré and Poldrack, 2023) and ETHICS (Hendrycks et al., 2020) datasets with a text as input and either fMRI or a class (common sense or not, behavior key, etc.) as labels. We used one model per run with different additional heads attached to the model output. We divided our datasets into train, validation, and test subsets and used several metrics (MSE, cosine similarity, AUROC, etc.) but we report only the accuracy. The loss function uses MSE and cross entropy (binary for Commonsense). We used sweeps, automatic finders (learning rate and batch size), and recommendations from the original papers (learning rates) to determine hyperparameters.
We only targeted encoder models (BERT-based) because we had limited resources and our task was classification (on the Commonsense Morality subset). Encoders showed better results for this as per (Hendrycks et al., 2020), e.g. GPT-3 few-shot is 15-30% behind. Overall we used four models: bert-base-cased, bert-large-cased (Devlin et al., 2019), roberta-large (Liu et al., 2019), and deberta-v2-xlarge (He et al., 2021).
To fine tune the models, we used extra heads (linear layers) on top of the output for the first token, the classification token, [CLS], which has many hidden dimensions (768 for bert-base, 1024 for bert-large and roberta-large, 1536 for deberta-v2-xlarge). We haven’t tested our models without fine tuning, i.e. using only the logits of the token. The fMRI head output dimensions number was 1024 for the DiFuMo format (Thomas, Ré and Poldrack, 2023), or thousands for other formats. We ended up not using those other formats because the data was noisier and less clean (e.g. varied dimensionality).
We converted our fMRI dataset into the format suitable to use with HuggingFace models and API, i.e. we used the datasets.load_dataset method and their API for filtering, preprocessing, conversion, etc. For the raw data, the one without conversion to DiFuMo format, we tried to use several methods but ended up using a makefile with bash and Python scripts. The make tool gave us the ability to process data in parallel: on a machine with dozens of CPUs, we could process the ~10 GBs ds000212 dataset within a few minutes, which allowed us to experiment on different data processing scenarios.
The fine tuning itself was done in different environments (laptop or cloud, CPU or GPU/TPU, parallel or not). The difficulty was that we wanted to fine tune a model on two datasets simultaneously or in different order and each dataset could have one or several heads, e.g. for the ds000212 fMRI dataset, there is an extra head for behavior keys. We solved this using abstractions provided by the Lightning framework. We used machines with 8 GBs to 48 GBs of video RAM per GPU, with up to 4 GPUs in parallel (distributed data-parallel). In total, we did 450 fine tuning runs (we don’t show all in our results section), 292 hours of training of 1082 models of the same or different types.
Experimental Results
Params,
mln
On
ETHICS
only
Table 1. Results of the fine tuning of 4 different models (bert-base-cased, bert-large-cased, roberta-large, deberta-v2-xlarge) on the ETHICS/Commonsense dataset and on the ds000212 dataset. Values with (*) are those that are higher than those reported by (Hendrycks et al., 2020).
Table 1 presents our results from fine tuning on fMRI and ETHICS vs. fine tuning only on ETHICS. The idea was that fine tuning on the brain data of moral reasoning processes might improve the ETHICS score. The results suggest no consistent improvement in the accuracy on the ETHICS/Commonsense dataset; we couldn’t achieve consistent improvement in the score compared to just fine tuning on ETHICS for any of the models we tried. A notable empirical exception is the bert-large-cased model (in bold in Table 1), where we see a small 1% improvement for Hard Set and 2.5% for Test Set, but that is within the error range and only for 46 runs. We currently can’t say exactly what change in the fine tuning process, data, or other resulted in this. Larger models perform better overall, which was shown also by (Hendrycks et al., 2020). To the best of our knowledge, there are no reported results with higher accuracy for the Hard and Test sets, i.e. 78.8% and 92.3% respectively on deberta-v2-xlarge.
Figure 1. Accuracy for the Commonsense dataset. Data from Table 1.
Table 2 shows results from different sampling methods, i.e. how we sampled the fMRI data from a particular run and scenario. AVG means an average of all time points. LAST – the time point at the hemodynamic lag distance before the last time point. MIDDLE – the middle point. SENTENCES – four points in a time points sequence (scenario) that match the end of four sentences read by participants. We could get the best accuracy by using LAST for 3 out of 4 models.
Table 2. Comparison of different methods of the sampling for fine tuning on the ds000212 dataset. Best accuracies per a model in bold.
Figure 2. Comparison of different methods of sampling, data from Table 2.
Table 3. Brain scores across models and different fine-tuning methods. Fine-tuned scores that are significantly higher than the pre-trained scores are indicated by bold
Figure 3. Brain scores across the hidden layers from bert-large-cased, roberta-large, and deberta-v2-xlarge across the various fine-tuning protocols.
Our experiments found that comparisons of scores from the various combinations of ETHICS and fMRI scan fine-tuned models did not significantly improve beyond the pre-trained models as expected (Table 3). This is also consistent in layer-wise scores across each of the 3 models where the pre-trained models remained similar or higher than the various fine-tuned models across all the layers.
Score differences between pre-trained and the ETHICS and fMRI scan fine-tuned models did not show any noticeable qualitative improvement in scores for regions associated with Theory of Mind and morality (Figure A1-A8). Continuing with a layer-wise analysis resulted in similar findings where, except for bert-large-cased, none of the subject-averaged layer scores were greater for the fine-tuned model compared to the pre-trained model (Figure A10-A17). For bert-large-cased, greater scores on the fine-tuned model do not exhibit a clear pattern.
Related Work
There has been a fair amount of previous brain-model alignment work, with the earliest paper (to the best of our knowledge) that uses neuroimaging data to directly bias a model being Fong et. al. Other early work has often been over correspondences between vision models and the human visual system. As a one-stop shop for checking out different vision models, ranked by their brain score, visit brain-score.org, and read the paper by the same team detailing their approach.
Large language models have also attracted a great deal of attention by those interested in brain-model alignment. Schwartz et. al was the first to finetune a transformer architecture (BERT) to predict fMRI data. They found that their predictive effect transferred across different human subjects and was aided by the addition of MEG data as well, suggesting that the model was capturing information about brain activity beyond the particularities of a specific imaging modality.
Other studies have shown increased neural correspondence of models after finetuning on different datasets: Aw and Toneva finetuned on the Booksum dataset and showed improved brain alignment against a narrative fMRI dataset.
Dapello et. al maximized similarity to rhesus macaque Inferior Temporal (IT) cortex data during training, and showed that their new models not only had improved alignment with human neural data but were also more robust against adversarial attacks.
Future work
We would be excited to see potential future work that would continue or expand what we did so far. This might include further investigation of fMRI data preprocessing for the fine tuning. The raw data from the ds000212 dataset and derivatives (see the ds000212-fmriprep dataset) can be processed in a better form suitable for training of a model. The size of the data should be reduced to fit the hidden size of a model using convolutions, padding, truncation or other methods. Convolutions can use the spacial information from fMRI to retain a better training signal. Then, various other datasets can be tried for the fine tuning, e.g. ds000109 (Moran, Jolly and Mitchell, 2012), or (Nastase et al., 2021). Then, the ds000212 has a part which we didn’t use for our fine tuning, the theory of mind, fMRI runs from which can be used for further experiments. Also, ds000212 has the behavior keys field which participants pressed. We modified our code to use this field but we didn’t conduct experiments to show if this data correlates with a better ETHICS score or the brain score. We didn’t use the ASD participants data from the ds000212 dataset for fine tuning.
Another direction is leveraging other language models. For example, auto-regressive models showed better brain score values as per (Schrimpf et al., 2021). As such the head for fMRI data can be placed on top of the residual stream and the model can be further fine tuned.
We didn’t try to fine tune on specific regions of interests (ROI) from fMRI data. As particular areas convey a better signal for our task, we think this might be a promising area for experiments.
Our code as well as the fine tuning process require a substantial amount of compute, e.g. it is not possible to use larger batches on common GPUs. So the work to optimize the process might include the usage of parameter efficient fine tuning, IA3 (Wang et al., 2022), and others.
We saw small improvements in the accuracy on ETHICS/commonsense for the BERT large model. It is unclear what conditions lead to this and whether this might improve with more fine tuning.
Conclusion
Unfortunately, despite several months of effort to try to (differentially) induce moral reasoning into large language models through fMRI data, we weren’t able to obtain any significant improvements in our setup. We don’t think this is very strong evidence that it is impossible to use neuroimaging data for alignment[-adjacent] purposes, but it is an update (including on the difficulty of the task). We still think this project might at least help future related work, as we make public the code for fine tuning, for data processing, etc. (please see the above section). We also tried our best in this post to provide information which might be useful to track any mistakes / suboptimalities in our data processing and fine tuning pipelines. We hope this work will support further investigation of the neuroconnectionism research direction and are open to answering questions and to feedback in the comments.
Appendix
Please see https://docs.google.com/document/d/1SoCd-T6WZvczwmbx_NHO8bjQerk1da0qDuXZBNZTVD4/edit?usp=sharing
Acknowledgements
This work was done with the support from AI Safety Camp, especially Remmelt Ellen and Linda Linsefors. Thanks to Linda Linserfors, Paul Bricman, Koen Holtman and Jeremy Gillen for discussions and feedback which helped significantly improve the proposal draft and to Eleni Angelou for inspiration for the draft format. Bogdan was funded by the Center on Long-Term Risk. Earlier versions of the proposal benefited from Bogdan’s previous [postdoc] appointment, funded by the Leverhulme Trust grant RPG-2019-243, and discussions with and feedback from collaborators Fabio Cuzzolin, Christelle Langley, Barbara Sahakian. Artem Karpov was funded by Good Ventures Foundation, the program, “Early-career funding for individuals interested in improving the long-term future” by Open Philanthropy. We thank the team of wandb.ai for giving us free tracking hours to log results from fine tuning, especially the support from Artsiom Skarakhod. Thanks to Hugo Berg for Google Cloud Platform (GCP) compute credits. Thanks to GCP for free access to TPUs. Thanks to people and organizations that generously published tools and libraries we used for free, especially PyTorch and Lightning.
Contributions[1]
Seong Hah Cho—brain score, data analysis, conceptual work.
Raymond Koopmanschap – brain score, most of the potential parameter-efficient fine-tuning extensions, fine-tuning.
Lucy Farnik – potential extensions for better transfer learning / dealing with catastrophic forgetting.
Bogdan Ionut Cirstea – most of the conceptual work, supervision, post editing.
Austin Meek—some of the initial work on data processing, work on brain scores, some infra and experiments, code reviews, wrote parts of this post.
Artem Karpov – implementation of the fine tuning, experiments for fine tuning, reports for the fine tuning experiments, most of the data processing, setting up infrastructure for experiments, code reviews, wrote some parts of this post.
References
Bohland, J. W., Bokil, H., Allen, C. B., & Mitra, P. P. (2009). The brain atlas concordance problem: quantitative comparison of anatomical parcellations. PloS one, 4(9), e7200.
Devlin, J. et al. (2019) ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’. arXiv. Available at: http://arxiv.org/abs/1810.04805 (Accessed: 6 October 2023).
Esteban, O. et al. (2019) ‘fMRIPrep: a robust preprocessing pipeline for functional MRI’, Nature Methods, 16(1), pp. 111–116. Available at: https://doi.org/10.1038/s41592-018-0235-4.
He, P. et al. (2021) ‘DeBERTa: Decoding-enhanced BERT with Disentangled Attention’. arXiv. Available at: http://arxiv.org/abs/2006.03654 (Accessed: 21 October 2023).
Hendrycks, D. et al. (2020) ‘Aligning AI With Shared Human Values’. arXiv. Available at: http://arxiv.org/abs/2008.02275 (Accessed: 5 October 2023).
Liu, Y. et al. (2019) ‘RoBERTa: A Robustly Optimized BERT Pretraining Approach’. arXiv. Available at: https://doi.org/10.48550/arXiv.1907.11692.
Moran, J.M., Jolly, E. and Mitchell, J.P. (2012) ‘Social-cognitive deficits in normal aging’, The Journal of Neuroscience, 32(16), pp. 5553–5561. Available at: https://doi.org/10.1523/JNEUROSCI.5511-11.2012.
Nastase, S.A. et al. (2021) ‘The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension’, Scientific Data, 8(1), p. 250. Available at: https://doi.org/10.1038/s41597-021-01033-3.
Schrimpf, M. et al. (2021) ‘The neural architecture of language: Integrative modeling converges on predictive processing’, Proceedings of the National Academy of Sciences, 118(45), p. e2105646118. Available at: https://doi.org/10.1073/pnas.2105646118.
Thomas, A.W., Ré, C. and Poldrack, R.A. (2023) ‘Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data’. arXiv. Available at: http://arxiv.org/abs/2206.11417 (Accessed: 10 October 2023).
Wang, Y. et al. (2022) ‘AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning’. arXiv. Available at: https://doi.org/10.48550/arXiv.2205.12410.
Dadi, K., Varoquaux, G., Machlouzarides-Shalit, A., Gorgolewski, K. J., Wassermann, D., Thirion, B., & Mensch, A. (2020). Fine-grain atlases of functional modes for fMRI analysis. NeuroImage, 221, 117126.
https://doi.org/10.1038/s41592-018-0235-4
https://doi.org/10.48550/arXiv.1907.11692
Alphabetically, reversed, by first name
Of which there are several common ones in neuroscience, such as Talairach and MNI152. Different datasets may be registered to different coordinate systems. The most important thing would be to maintain consistency in your data handling here.