If aligning messy models turns out to be too hard, don’t build messy models.
One of the big advantages we (or Clippy) have when trying to figure out alignment is that we are not trying to align a fixed AI design, nor are we even trying to align it to a fixed representation of our goals. We’re just trying to make things go well in the broad sense.
It’s easy for there to be specific architectures that are too messy to align, or specific goals that are to hard to teach an AI. But it’s hugely implausible to me that all ways of making things go well are too hard.
“Messy models” is not a binary yes/no option, though—there’s a spectrum of how interpretable different possible successors are. If you are only willing to use highly interpretable models, that’s a sort of tax you have to pay, the exact value of which depends a lot on the details of the models and strategic situation. What I’m claiming is that this tax, as a fraction of total resources, might remain bounded away from zero for a long subjective time.
it has to, because the large scale physical systems being modeled by the learned system are not coherent enough to be formally verified at the moment.
but I think we’re going to find there are key kinds of interpretability that are synergistic with capability, not detractors from it. sparsity is incredibly critical for speedup, and sparsity is key to alignment. disentanglement of factors improves both model and understandability of model; better models have larger internal margins between decision boundaries, and are therefore more verifiable.
I definitely agree with most of the thrust of the insights that can be taken from the thought experiment where it’s just impossible to error check efficiently, but I think we can actually expect to be able to do quite a lot. the hard part is how to incrementalize verification in a way that still tells us incrementally about the thing we actually want to know, which is the distribution of { distances from outcomes harmful to the preferences of other beings } for all { action a model outputs }.
If aligning messy models turns out to be too hard, don’t build messy models.
One of the big advantages we (or Clippy) have when trying to figure out alignment is that we are not trying to align a fixed AI design, nor are we even trying to align it to a fixed representation of our goals. We’re just trying to make things go well in the broad sense.
It’s easy for there to be specific architectures that are too messy to align, or specific goals that are to hard to teach an AI. But it’s hugely implausible to me that all ways of making things go well are too hard.
“Messy models” is not a binary yes/no option, though—there’s a spectrum of how interpretable different possible successors are. If you are only willing to use highly interpretable models, that’s a sort of tax you have to pay, the exact value of which depends a lot on the details of the models and strategic situation. What I’m claiming is that this tax, as a fraction of total resources, might remain bounded away from zero for a long subjective time.
it has to, because the large scale physical systems being modeled by the learned system are not coherent enough to be formally verified at the moment.
but I think we’re going to find there are key kinds of interpretability that are synergistic with capability, not detractors from it. sparsity is incredibly critical for speedup, and sparsity is key to alignment. disentanglement of factors improves both model and understandability of model; better models have larger internal margins between decision boundaries, and are therefore more verifiable.
I definitely agree with most of the thrust of the insights that can be taken from the thought experiment where it’s just impossible to error check efficiently, but I think we can actually expect to be able to do quite a lot. the hard part is how to incrementalize verification in a way that still tells us incrementally about the thing we actually want to know, which is the distribution of { distances from outcomes harmful to the preferences of other beings } for all { action a model outputs }.