That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model.
You still might die because you didn’t understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor).
Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you’d probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).
Agreed.
That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model.
You still might die because you didn’t understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor).
Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you’d probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).