VojtaKovarik comments on VojtaKovarik’s Shortform

VojtaKovarik 5 Feb 2024 22:45 UTC
3 points
2
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven’t solved everything, you must have made a bunch of progress.
Right, I agree. I didn’t realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like “[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system.”
I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.
- ryan_greenblatt 6 Feb 2024 0:38 UTC
  3 points
  0
  Parent
  Agreed.
  
  That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model.
  
  You still might die because you didn’t understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor).
  
  Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you’d probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).