Suppose I have a slow process I trust that I use to provide sparse ground truth for my system (like a very extensive human evaluation). But day-to-day I need to use my ML system because it’s much cheaper. I’m concerned that it may take some catastrophically bad actions at test time because it thinks that it can take over. But if I can flag those as anomalous, then I can invoke my slow oversight process, include the datapoint in training data, update my model to be less likely to try to take a treacherous turn, and then continue. If my model learns quickly then I won’t have to do this very many times before it stops trying to take a treacherous turn.
Suppose I have a slow process I trust that I use to provide sparse ground truth for my system (like a very extensive human evaluation). But day-to-day I need to use my ML system because it’s much cheaper. I’m concerned that it may take some catastrophically bad actions at test time because it thinks that it can take over. But if I can flag those as anomalous, then I can invoke my slow oversight process, include the datapoint in training data, update my model to be less likely to try to take a treacherous turn, and then continue. If my model learns quickly then I won’t have to do this very many times before it stops trying to take a treacherous turn.