In an MRI scan you see which parts of a brain light up in response to a stimulus. This has proven invaluable in understanding brains.
Is there an equivalent thing you can do with deep learning models, where you can see which parts light up in response to stimuli? And does there exist good UIs to explore this? It seems like such a technique would be invaluable for understanding deep learning models, and possibly for alignment.
Most neural networks don’t have anything comparable to specialised brain areas, at least structurally, so you can’t see which areas light up given some stimulus to determine what that part does. You can do it with individual neurons or channels, though. The best UI I know of to explore this is the “Dataset Samples” option in the OpenAI Microscope, that shows which inputs activate each unit.
The most similar analysis tool I’m aware of is called an activation atlas (https://distill.pub/2019/activation-atlas/), though I’ve only seen it applied to visual networks. Would love to see it used on language models!