“Scalability of this approach—can we do this on large models? Scalability of analysis—can we turn a microscopic understanding of large models into a macroscopic story that answers questions we care about?”
“Make this work for real models. Find out what features exist in large models. Understand new, more complex circuits.”
When it comes to manipulation, other recent paper seems more promising IMO! Like fMRI.
“This might be the biggest alignment paper of the year. Everyone has been complaining that mechanistic interpretability is like doing LLM cell microbiology, when what we really need is LLM neuro-imaging. Well now we have it: “representation engineering” Similar to an fMRI scan, CAIS creates the LAT (Linear Artificial Tomography) scan. They also do a form of LLM neuro-modulation, getting the model to be honest or deceptive by just adding in a vector to its activations. imo this could be the winning alignment agenda”
We’re at the start of interpretability, but the progress is lovely! Superposition was such a bottleneck even in small models.
More notes:
https://twitter.com/ch402/status/1710004685560750153 https://twitter.com/ch402/status/1710004416148058535
“Scalability of this approach—can we do this on large models? Scalability of analysis—can we turn a microscopic understanding of large models into a macroscopic story that answers questions we care about?”
“Make this work for real models. Find out what features exist in large models. Understand new, more complex circuits.”
When it comes to manipulation, other recent paper seems more promising IMO! Like fMRI.
https://twitter.com/mezaoptimizer/status/1709292930416910499
https://arxiv.org/abs/2310.01405
https://ai-transparency.org
“This might be the biggest alignment paper of the year. Everyone has been complaining that mechanistic interpretability is like doing LLM cell microbiology, when what we really need is LLM neuro-imaging. Well now we have it: “representation engineering” Similar to an fMRI scan, CAIS creates the LAT (Linear Artificial Tomography) scan. They also do a form of LLM neuro-modulation, getting the model to be honest or deceptive by just adding in a vector to its activations. imo this could be the winning alignment agenda”