This is a huge problem area in NLP. Quite a few issues you raised, but just to pick 2:
There are a large class of situations where the true model is just how a human would respond. For example, the answer to “Is this good art?” is only predictable with knowledge about the person answering (and the deictic ‘this’, but that’s a slightly different question). In these cases, I’d argue that the true model inherently needs to model the respondent. There’s a huge range, but even in the limit case where there is an absolute true answer (and the human is absolutely wrong), modeling that seems valuable for any AI that has to interact with humans. In any case, one slightly older link to give an example of the literature here: https://www.aclweb.org/anthology/P15-1073/
There’s a much larger literature around resolving issues of inter-rater reliability which may be of interest. Collected data is almost always noisy and research is extensive on evaluating and approaching that. Given the thrust of your article here, one that may be of more interest to you is active learning where the system evaluates its own uncertainty and actively requests examples to help improve its model. Another older example from which you can trace newer work: https://dl.acm.org/doi/10.1109/ACII.2015.7344553
This is a huge problem area in NLP. Quite a few issues you raised, but just to pick 2:
There are a large class of situations where the true model is just how a human would respond. For example, the answer to “Is this good art?” is only predictable with knowledge about the person answering (and the deictic ‘this’, but that’s a slightly different question). In these cases, I’d argue that the true model inherently needs to model the respondent. There’s a huge range, but even in the limit case where there is an absolute true answer (and the human is absolutely wrong), modeling that seems valuable for any AI that has to interact with humans. In any case, one slightly older link to give an example of the literature here: https://www.aclweb.org/anthology/P15-1073/
There’s a much larger literature around resolving issues of inter-rater reliability which may be of interest. Collected data is almost always noisy and research is extensive on evaluating and approaching that. Given the thrust of your article here, one that may be of more interest to you is active learning where the system evaluates its own uncertainty and actively requests examples to help improve its model. Another older example from which you can trace newer work: https://dl.acm.org/doi/10.1109/ACII.2015.7344553