So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.
For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn’t think of the papers you linked as much evidence here.
For 2, that would for sure do it, but it doesn’t feel like much of a reduction.
3 sounds like it’s maybe definitionally true? At the very least, I don’t doubt it much.
Interesting, I’m genuinely curious what you’d expect better evidence to look like for 1.
So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.