I think this is a neat idea worth experimenting with. If I’m understanding your proposal, there’d need to be some sort of ‘suspiciousness’ signal on the training data to train the ‘suspiciousness-detecting’ head on. I think it could be hard to get such training data.
Whereas, training a ‘confidence’ head seems like an easier problem where you can have a model make a bunch of short terms predictions, and then grade those predictions, and then use the resulting labelled data to train a ‘confidence’ head. Ideally, these would be more interesting predictions than simply ‘what token comes next’, but that is better than nothing.
I think this is a neat idea worth experimenting with. If I’m understanding your proposal, there’d need to be some sort of ‘suspiciousness’ signal on the training data to train the ‘suspiciousness-detecting’ head on. I think it could be hard to get such training data.
Whereas, training a ‘confidence’ head seems like an easier problem where you can have a model make a bunch of short terms predictions, and then grade those predictions, and then use the resulting labelled data to train a ‘confidence’ head. Ideally, these would be more interesting predictions than simply ‘what token comes next’, but that is better than nothing.