It wasn’t clear that this applied to the statement “we couldn’t improve on using these” (mainly because I forgot you weren’t considering interactions).
I excluded the rater and ratee from the averages.
Okay, that gets rid of most of my worries. I’m not sure it account for covariance between correlation estimates of different averages, so I’d be interested in seeing some bootstrapped confidence intervals). But perhaps I’m preempting future posts.
Also, thinking about it more, you point out a number of differences between correlations, and it’s not clear to me that those differences are significant as opposed to just noise.
I’m not sure whether this answers your question, but I used log loss as a measure of accuracy.
I was using “accuracy” in the technical sense, i.e., one minus what you call “Total Error” in your table. (It’s unfortunate that Wikipedia says scoring rules like log-loss are a measure of the “accuracy” of predictions! I believe the technical usage, that is, percentage properly classified for a binary classifier, is a more common usage in machine learning.)
The total error of a model is in general not super informative because it depends on the base rate of each class in your data, as well as the threshold that you choose to convert your probabilistic classifier into a binary one. That’s why I generally prefer to see likelihood ratios, as you just reported, or ROC AUC scores (which integrates over a range of thresholds).
(Although apparently using AUC for model comparison is questionable too, because it’s noisy and incoherent in some circumstances and doesn’t penalize miscalibration, so you should use the H measure instead. I mostly like it as a relatively interpretable, utility-function-independent rough index of a model’s usefulness/discriminative ability, not a model comparison criterion.)
Also, thinking about it more, you point out a number of differences between correlations, and it’s not clear to me that those differences are significant as opposed to just noise.
It wasn’t clear that this applied to the statement “we couldn’t improve on using these” (mainly because I forgot you weren’t considering interactions).
Okay, that gets rid of most of my worries. I’m not sure it account for covariance between correlation estimates of different averages, so I’d be interested in seeing some bootstrapped confidence intervals). But perhaps I’m preempting future posts.
Also, thinking about it more, you point out a number of differences between correlations, and it’s not clear to me that those differences are significant as opposed to just noise.
I was using “accuracy” in the technical sense, i.e., one minus what you call “Total Error” in your table. (It’s unfortunate that Wikipedia says scoring rules like log-loss are a measure of the “accuracy” of predictions! I believe the technical usage, that is, percentage properly classified for a binary classifier, is a more common usage in machine learning.)
The total error of a model is in general not super informative because it depends on the base rate of each class in your data, as well as the threshold that you choose to convert your probabilistic classifier into a binary one. That’s why I generally prefer to see likelihood ratios, as you just reported, or ROC AUC scores (which integrates over a range of thresholds).
(Although apparently using AUC for model comparison is questionable too, because it’s noisy and incoherent in some circumstances and doesn’t penalize miscalibration, so you should use the H measure instead. I mostly like it as a relatively interpretable, utility-function-independent rough index of a model’s usefulness/discriminative ability, not a model comparison criterion.)
More to follow (about to sleep), but regarding
What do you have in mind specifically?