Human raters make systematic errors—regular, compactly describable, predictable errors.
This implies it’s possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model’s error prediction to the rater answer and get a correct label.
The whole problem with “Human raters make systematic errors” is that this is likely to happen to the heavily scrutinized ground truth. If you have a way of creating a correct ground truth that avoids this problem, you don’t need the second model, you can just use that as the dataset for the first model.
This implies it’s possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model’s error prediction to the rater answer and get a correct label.
The whole problem with “Human raters make systematic errors” is that this is likely to happen to the heavily scrutinized ground truth. If you have a way of creating a correct ground truth that avoids this problem, you don’t need the second model, you can just use that as the dataset for the first model.