it has to, because the large scale physical systems being modeled by the learned system are not coherent enough to be formally verified at the moment.
but I think we’re going to find there are key kinds of interpretability that are synergistic with capability, not detractors from it. sparsity is incredibly critical for speedup, and sparsity is key to alignment. disentanglement of factors improves both model and understandability of model; better models have larger internal margins between decision boundaries, and are therefore more verifiable.
I definitely agree with most of the thrust of the insights that can be taken from the thought experiment where it’s just impossible to error check efficiently, but I think we can actually expect to be able to do quite a lot. the hard part is how to incrementalize verification in a way that still tells us incrementally about the thing we actually want to know, which is the distribution of { distances from outcomes harmful to the preferences of other beings } for all { action a model outputs }.
it has to, because the large scale physical systems being modeled by the learned system are not coherent enough to be formally verified at the moment.
but I think we’re going to find there are key kinds of interpretability that are synergistic with capability, not detractors from it. sparsity is incredibly critical for speedup, and sparsity is key to alignment. disentanglement of factors improves both model and understandability of model; better models have larger internal margins between decision boundaries, and are therefore more verifiable.
I definitely agree with most of the thrust of the insights that can be taken from the thought experiment where it’s just impossible to error check efficiently, but I think we can actually expect to be able to do quite a lot. the hard part is how to incrementalize verification in a way that still tells us incrementally about the thing we actually want to know, which is the distribution of { distances from outcomes harmful to the preferences of other beings } for all { action a model outputs }.