Having already played with this a little, it’s pretty amazing: the range of concepts you can find in the SAE, how clearly the autointerp has labelled them and how easy they are to find, and how effective they are (as long as you don’t turn them up too much) are all really impressive. I can’t wait to try a production model where you can set up sensors and alarms on features, clip or ablate them or wire them together at various layers, and so forth. It will also be really interesting to see how larger models compare.
I’d also love to start looking at jailbreaks with this and seeing what features the jailbreak is inducing in the residual stream. Finding the emotional/situational manipulation elements I suspect will be pretty easy — I’m curious to see if it will also show the ‘confusion’ effect of jailbreaks that read like confusing nonsense to a human as some form of confusion or noise, or if those are also emotional/situational manipulation just in a more noise-like adversarial format, comparable to adversarial attacks on image classifiers that just look like noise to a human eye, but actually effectively activate internal features of the vision model
Having already played with this a little, it’s pretty amazing: the range of concepts you can find in the SAE, how clearly the autointerp has labelled them and how easy they are to find, and how effective they are (as long as you don’t turn them up too much) are all really impressive. I can’t wait to try a production model where you can set up sensors and alarms on features, clip or ablate them or wire them together at various layers, and so forth. It will also be really interesting to see how larger models compare.
I’d also love to start looking at jailbreaks with this and seeing what features the jailbreak is inducing in the residual stream. Finding the emotional/situational manipulation elements I suspect will be pretty easy — I’m curious to see if it will also show the ‘confusion’ effect of jailbreaks that read like confusing nonsense to a human as some form of confusion or noise, or if those are also emotional/situational manipulation just in a more noise-like adversarial format, comparable to adversarial attacks on image classifiers that just look like noise to a human eye, but actually effectively activate internal features of the vision model