[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.
I’ve been thinking about this in the context of M.I. approaches to lie detection (some of my thoughts on that captured in this comment thread), and I share your intuition. It seems to me that you could straightforwardly include a loss term representing the presence or absence of certain features in the activation pattern, creating optimization pressure both away from whatever characteristic that pattern (initially) represents, and toward representing that characteristic differently. Would you agree?
If that’s true, then to the extent that some researchers (irresponsibly?) incorporate MI-based lie detection during training, it’ll create pressure both away from lying and also toward lying in illegible ways. That seems like over time it’ll be likely to cause problems for any such lie detection approach.
In any case, the idea that you can train for the presence or absence of certain activation patterns seems very testable, and I’ll try testing it unless you’re aware of existing research that’s done so.
Stumbled over while trying to do some searching on comparable research: kind of an interesting paper on runtime monitoring from 2018: Runtime Monitoring Neuron Activation Patterns. Abstract:
For using neural networks in safety critical domains, it is important to know if a decision made by a neural network is supported by prior similarities in training. We propose runtime neuron activation pattern monitoring—after the standard training process, one creates a monitor by feeding the training data to the network again in order to store the neuron activation patterns in abstract form. In operation, a classification decision over an input is further supplemented by examining if a pattern similar (measured by Hamming distance) to the generated pattern is contained in the monitor. If the monitor does not contain any pattern similar to the generated pattern, it raises a warning that the decision is not based on the training data. Our experiments show that, by adjusting the similarity-threshold for activation patterns, the monitors can report a significant portion of misclassfications to be not supported by training with a small false-positive rate, when evaluated on a test set.
(my guess after a quick initial read is that this may only be useful to narrow AI; ‘significantly extrapolate
from what it learns (or remembers) from the training data, as similar data has not appeared in the training process’ seems like a normal part of operations for more general systems like LLMs]
I’ve been thinking about this in the context of M.I. approaches to lie detection (some of my thoughts on that captured in this comment thread), and I share your intuition. It seems to me that you could straightforwardly include a loss term representing the presence or absence of certain features in the activation pattern, creating optimization pressure both away from whatever characteristic that pattern (initially) represents, and toward representing that characteristic differently. Would you agree?
If that’s true, then to the extent that some researchers (irresponsibly?) incorporate MI-based lie detection during training, it’ll create pressure both away from lying and also toward lying in illegible ways. That seems like over time it’ll be likely to cause problems for any such lie detection approach.
In any case, the idea that you can train for the presence or absence of certain activation patterns seems very testable, and I’ll try testing it unless you’re aware of existing research that’s done so.
Oh, there’s also a good new paper on this topic from Samuel Marks & Max Tegmark, ‘The Geometry of Truth’!
Stumbled over while trying to do some searching on comparable research: kind of an interesting paper on runtime monitoring from 2018: Runtime Monitoring Neuron Activation Patterns. Abstract:
(my guess after a quick initial read is that this may only be useful to narrow AI; ‘significantly extrapolate from what it learns (or remembers) from the training data, as similar data has not appeared in the training process’ seems like a normal part of operations for more general systems like LLMs]