Garrett Baker comments on D0TheMath’s Shortform

Garrett Baker 27 Sep 2023 1:38 UTC
2 points
- Train autoregressive network on activations, if predictions too far, then send warning
- Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
  - model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
- Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
  - Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
  - Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs