I get the feeling that I’m more optimistic about post-hoc interpretability approaches working well in the case of advanced AIs. I’m referring to the ability of an advanced AI in the form of a super large neural network-based agent to take another super large neural network-based agent and verify its commitment successfully. I think this is at least somewhat likely to work by default (i.e. scrutinizing advanced neural network-based AIs may be easier than obfuscating intentions). I also think this may potentially not require that much information about the training method and training data.
I thought before that this doesn’t matter in practice because of possibility of self-modification and successor agents. But I now think that at least in some range of potential situations verifying the behavior of a neural network seems enough for credible commitment when an agent pre-commits to using this neural network e.g. via a blockchain.
Also, are you sure that the fact that people can’t simulate nematodes fits well in this argument? I may well be mistaken but I thought that we do not really have neural network weights for nematodes, we only have the architecture. In this case it seems natural that we can’t do forward passes.
Super thoughtful post!
I get the feeling that I’m more optimistic about post-hoc interpretability approaches working well in the case of advanced AIs. I’m referring to the ability of an advanced AI in the form of a super large neural network-based agent to take another super large neural network-based agent and verify its commitment successfully. I think this is at least somewhat likely to work by default (i.e. scrutinizing advanced neural network-based AIs may be easier than obfuscating intentions). I also think this may potentially not require that much information about the training method and training data.
I thought before that this doesn’t matter in practice because of possibility of self-modification and successor agents. But I now think that at least in some range of potential situations verifying the behavior of a neural network seems enough for credible commitment when an agent pre-commits to using this neural network e.g. via a blockchain.
Also, are you sure that the fact that people can’t simulate nematodes fits well in this argument? I may well be mistaken but I thought that we do not really have neural network weights for nematodes, we only have the architecture. In this case it seems natural that we can’t do forward passes.