Austin Meek

Karma: 109

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

141 points

15 comments13 min readLW link

Austin Meek 7 Mar 2024 20:47 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: Inducing human-like biases in moral reasoning LMs
Hi Charlie, thanks for your comment and apologies for the late reply. To echo what Artyom said, we didn’t observe a significant difference between the models we fine-tuned vs. the base models with regard to the brain-score. Our models did not end up becoming more correlated with the neuroimaging data of subjects taking these moral reasoning tests. In the future it would be neat if this works and we can start utilizing more neuroimaging data for alignment, but this initial stab didn’t make those concrete connections.
To go into a bit more detail on the brain-score metric, we use the Pearson’s correlation coefficient (PCC) to measure the correlation there. For some moral scenario given to the subject, we can take the neuroimaging data at different times (it’s measured every 2 seconds here, and we experiment with different sampling strategies), after taking the hemodynamic delay into account. This data is then fit to 1,024 dimensions, with the value being the BOLD response at that point. We do this over a small portion of similar examples, then fit a regression model to predict the 1,024-dimensional BOLD response vector for some scenario held out from that set. Finally, we take the PCC between the predicted response and the activations at some layer. This gives us our metric of brain score. So we can see this on a layer-by-layer basis and also aggregate it to get a brain-score for an entire model, which we report.
Hope that helps! Let me know if there’s anything more I can help clarify.