Cool project. I’m not sure if it’s interesting to me for alignment, since it’s such a toy model. What do you think would change when trying to do similar interpretability on less-toy models? What would change about finding adversarial examples? Directly intervening on features seems like it might stay the same though.
I’m not sure if it’s interesting to me for alignment, since it’s such a toy model.
Cruxes here are things like whether you think toy models are governed by the same rules as larger models, whether studying them helps you understand those general principles and whether understanding those principles is valuable. This model in particular shares many similarities in architecture and training to LLMs and is over a million parameters so it’s not nearly as much of a toy model as others and we have particular reasons to expect insights to transfer (both transformers / next token predictors).
What do you think would change when trying to do similar interpretability on less-toy models?
The recipe stays mostly the same but scale increases and you know less about the training distribution.
Features: Feature detection in LLM’s viasparseauto-encoders seems highly tractable. There may be more features and you might have less of a sense for the overall training distribution. Once you collapse latent space into features, this will go a long way to dealing with the curse of dimensionality with these systems.
Training Data: We know much less about the training distribution of larger models (ie: what are the ground truth features, how to they correlate or anti-correlate).
Circuits This investigation treats circuits like a black-box, but larger models will likely solve more complex tasks with more complicated circuitry. The cool thing about knowing the features is that you can get to fairly deep insights even without understanding the circuits (like showing which observations are effectively equivalent to the model).
What would change about finding adversarial examples?
This is a very complicated/broad question. There’s a number of ways you could approach this. I’d probably look at identifying critical features in the language model and see whether we can develop automatic techniques for flipping them. This could be done recursively if you are able to find the features most important for those features (etc.). Understanding why existing adversaries like jail-breaking techniques / initial affirmative responses work (mechanistically) might tell us a lot about how to automate more general search for adversaries. However, my guess is that the task of finding adversaries using a white-box approaches may be fairly tractable. The search space is much smaller once you know features and there are many search strategies that might work to flip features (possibly working recursively through features in each layer and guided by some regularization designed to keep adversaries naturalistic/plausible.
Directly intervening on features seems like it might stay the same though.
This doesn’t seem super obvious if features aren’t orthogonal, or may exist in subspaces or manifolds rather than individual directions. The fact that this transition isn’t trivial is one reason it would be better to understand some simple models very well (so that when we go to larger models, we’re on surer scientific footing).
Cool project. I’m not sure if it’s interesting to me for alignment, since it’s such a toy model. What do you think would change when trying to do similar interpretability on less-toy models? What would change about finding adversarial examples? Directly intervening on features seems like it might stay the same though.
Cruxes here are things like whether you think toy models are governed by the same rules as larger models, whether studying them helps you understand those general principles and whether understanding those principles is valuable. This model in particular shares many similarities in architecture and training to LLMs and is over a million parameters so it’s not nearly as much of a toy model as others and we have particular reasons to expect insights to transfer (both transformers / next token predictors).
The recipe stays mostly the same but scale increases and you know less about the training distribution.
Features: Feature detection in LLM’s via sparse auto-encoders seems highly tractable. There may be more features and you might have less of a sense for the overall training distribution. Once you collapse latent space into features, this will go a long way to dealing with the curse of dimensionality with these systems.
Training Data: We know much less about the training distribution of larger models (ie: what are the ground truth features, how to they correlate or anti-correlate).
Circuits This investigation treats circuits like a black-box, but larger models will likely solve more complex tasks with more complicated circuitry. The cool thing about knowing the features is that you can get to fairly deep insights even without understanding the circuits (like showing which observations are effectively equivalent to the model).
This is a very complicated/broad question. There’s a number of ways you could approach this. I’d probably look at identifying critical features in the language model and see whether we can develop automatic techniques for flipping them. This could be done recursively if you are able to find the features most important for those features (etc.). Understanding why existing adversaries like jail-breaking techniques / initial affirmative responses work (mechanistically) might tell us a lot about how to automate more general search for adversaries. However, my guess is that the task of finding adversaries using a white-box approaches may be fairly tractable. The search space is much smaller once you know features and there are many search strategies that might work to flip features (possibly working recursively through features in each layer and guided by some regularization designed to keep adversaries naturalistic/plausible.
This doesn’t seem super obvious if features aren’t orthogonal, or may exist in subspaces or manifolds rather than individual directions. The fact that this transition isn’t trivial is one reason it would be better to understand some simple models very well (so that when we go to larger models, we’re on surer scientific footing).