If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.
While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than using it in the loss function, and hence makes goodhearting harder and hence slower, it’s not clear that this would solve the problem in the long run. If the interpretability-tool-based filter captures everything we know now to capture, and we don’t get new insights during the iterated process of model training and model selection, then it’s possible we’ll eventually end up goodharting the model selection process in the same was as SGD would goodhart the interpretability tool in the loss function.
I think it’s likely that we would gain more insights or have more time if we were to use the interpretability tool as a mulligan, and it’s possible the way we as AI builders optimise producing a model that passes the interpretability filters is qualitatively different from the way SGD (or whatever training algorithm is being used) would optimise the interpretability-filter loss function. However, in the spirit of paranoia/security mindset/etc., it’s worth pointing out that using the tool as a model selection filter doesn’t guarantee that an AGI that passes the filter is safer than if we used the interpretability tool as a training signal, in the limit of iterating to pass the interpretability tool model selection filter.
While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than using it in the loss function, and hence makes goodhearting harder and hence slower, it’s not clear that this would solve the problem in the long run. If the interpretability-tool-based filter captures everything we know now to capture, and we don’t get new insights during the iterated process of model training and model selection, then it’s possible we’ll eventually end up goodharting the model selection process in the same was as SGD would goodhart the interpretability tool in the loss function.
I think it’s likely that we would gain more insights or have more time if we were to use the interpretability tool as a mulligan, and it’s possible the way we as AI builders optimise producing a model that passes the interpretability filters is qualitatively different from the way SGD (or whatever training algorithm is being used) would optimise the interpretability-filter loss function. However, in the spirit of paranoia/security mindset/etc., it’s worth pointing out that using the tool as a model selection filter doesn’t guarantee that an AGI that passes the filter is safer than if we used the interpretability tool as a training signal, in the limit of iterating to pass the interpretability tool model selection filter.
I agree