If we are able to flag a treacherous turn as cognitively anomalous, then we can take that opportunity to shut down a system and retrain on the offending datapoint.
What do you mean by “retrain on the offending datapoint”? I would be worried about Goodhearting on this by selecting for systems which don’t set off the anomaly detector, and thereby making it a less reliable safeguard.
Suppose I have a slow process I trust that I use to provide sparse ground truth for my system (like a very extensive human evaluation). But day-to-day I need to use my ML system because it’s much cheaper. I’m concerned that it may take some catastrophically bad actions at test time because it thinks that it can take over. But if I can flag those as anomalous, then I can invoke my slow oversight process, include the datapoint in training data, update my model to be less likely to try to take a treacherous turn, and then continue. If my model learns quickly then I won’t have to do this very many times before it stops trying to take a treacherous turn.
What do you mean by “retrain on the offending datapoint”? I would be worried about Goodhearting on this by selecting for systems which don’t set off the anomaly detector, and thereby making it a less reliable safeguard.
Suppose I have a slow process I trust that I use to provide sparse ground truth for my system (like a very extensive human evaluation). But day-to-day I need to use my ML system because it’s much cheaper. I’m concerned that it may take some catastrophically bad actions at test time because it thinks that it can take over. But if I can flag those as anomalous, then I can invoke my slow oversight process, include the datapoint in training data, update my model to be less likely to try to take a treacherous turn, and then continue. If my model learns quickly then I won’t have to do this very many times before it stops trying to take a treacherous turn.