Jan Wehner comments on Activation Engineering Theories of Impact

Jan Wehner 19 Jul 2024 9:08 UTC
1 point
0
Thanks for writing this, I think it’s great to spell out the ToI behind this research direction!
You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is “thinking” about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.
You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.