eggsyntax comments on A Longlist of Theories of Impact for Interpretability

eggsyntax 20 Oct 2023 17:20 UTC
1 point
Stumbled over while trying to do some searching on comparable research: kind of an interesting paper on runtime monitoring from 2018: Runtime Monitoring Neuron Activation Patterns. Abstract:

For using neural networks in safety critical domains, it is important to know if a decision made by a neural network is supported by prior similarities in training. We propose runtime neuron activation pattern monitoring—after the standard training process, one creates a monitor by feeding the training data to the network again in order to store the neuron activation patterns in abstract form. In operation, a classification decision over an input is further supplemented by examining if a pattern similar (measured by Hamming distance) to the generated pattern is contained in the monitor. If the monitor does not contain any pattern similar to the generated pattern, it raises a warning that the decision is not based on the training data. Our experiments show that, by adjusting the similarity-threshold for activation patterns, the monitors can report a significant portion of misclassfications to be not supported by training with a small false-positive rate, when evaluated on a test set.

(my guess after a quick initial read is that this may only be useful to narrow AI; ‘significantly extrapolate from what it learns (or remembers) from the training data, as similar data has not appeared in the training process’ seems like a normal part of operations for more general systems like LLMs]