I was told that the error bars were somehow subsumed in the percentage.
They sort of are. In the long run, if your percentages aren’t calibrated, you’re doing something wrong; issues of sensitivity and robustness are one form of error among others and subsumed under the grand ultimate rubric—if you did your arithmetic wrong, you can expect to be uncalibrated; if you overestimated the quality of your data, you can expect to be uncalibrated; if you used rigid models which required powerful assumptions which often are violated in practice, you can expect to be uncalibrated. And so on.
Maybe there should be more than one kind of error bar.
Every summary statistic is going to cause problems somehow compared to something like a posterior distribution. One error bar won’t cover all the questions one might want to ask, and it’s not clear what error bars you want in advance. (Statistics and machine learning seem to be moving towards ensembles of models and hierarchical approaches like models over models and so forth, where one can vary all the knobs in general and see how the final answers perform, but ‘perform’ is going to be defined differently in different places.)
They sort of are. In the long run, if your percentages aren’t calibrated, you’re doing something wrong; issues of sensitivity and robustness are one form of error among others and subsumed under the grand ultimate rubric—if you did your arithmetic wrong, you can expect to be uncalibrated; if you overestimated the quality of your data, you can expect to be uncalibrated; if you used rigid models which required powerful assumptions which often are violated in practice, you can expect to be uncalibrated. And so on.
Every summary statistic is going to cause problems somehow compared to something like a posterior distribution. One error bar won’t cover all the questions one might want to ask, and it’s not clear what error bars you want in advance. (Statistics and machine learning seem to be moving towards ensembles of models and hierarchical approaches like models over models and so forth, where one can vary all the knobs in general and see how the final answers perform, but ‘perform’ is going to be defined differently in different places.)