Train autoregressive network on activations, if predictions too far, then send warning
Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs
Train autoregressive network on activations, if predictions too far, then send warning
Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs