Thanks for writing this, I think it’s great to spell out the ToI behind this research direction!
You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is “thinking” about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.
You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.
Thanks for writing this, I think it’s great to spell out the ToI behind this research direction!
You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is “thinking” about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.
You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.