If the dataset contained information on a sufficiently large number of dates for each participant, we could not improve on using [frequency with which members of the opposite sex expressed to see them again, and the frequency with which the participant expressed interest in seeing members of the opposite sex again].
I don’t think this is true. Consider the following model:
There is only one feature, eye color. The population is split 50-50 between brown and blue eyes. People want to date other people iff they are of the same eye color. Everyone’s ratings of eye color are perfect.
In this case, with only selectivity ratings, you can’t do better than 50% accuracy (any person wants to date any other person with 50% probability). But with eye-color ratings, you can get it perfect.
[correlation heatmap]
My impression is that there are significant structural correlations in your data that I don’t really understand the impact of. (For instance, at least if everyone rates everyone, I think the correlation of attr with attrAvg is guaranteed to be positive, even if attr is completely random.)
As a result, I’m having a hard time interpreting things like the fact that likeAvg is more strongly correlated with attr than with like. I’m also having a hard time verifying your interpretations of the observations that you make about this heatmap, because I’m not sure to what extent they are confounded by the structural correlations.
It seems implausible to me that each of the 25 correlations between the five traits of attractiveness, fun, ambition, intelligence and sincerity is positive.
Nitpick: There are only 10 distinct such correlations that are not 1 by definition.
The predictive power that we obtain
Model accuracy actually isn’t actually a great measure of predictive power, because it’s sensitive to base rates. (You at least mentioned the base rates, but it’s still hard to know how much to correct for the base rates when you’re interpreting the goodness of a classifier.)
As far as I know, if you don’t have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.
Model accuracy actually isn’t actually a great measure of predictive power, because it’s sensitive to base rates.
I was told that you only run into severe problems with model accuracy if the base rates are far from 50%. Accuracy feels pretty interpretable and meaningful here as the base rates are 30%-50%.
As far as I know, if you don’t have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.
Although ROC area under curve seems to have an awkward downside in that it penalises you for having poor prediction even when you set the sensitivity (the threshold) to a bad parameter. The F Score is pretty simple, and doesn’t have this drawback—it’s just a combination of some fixed sensitivity and specificity.
As you point out, there is ongoing research and discussion of this, which is confusing because as far as math goes, it doesn’t seem like that hard of a problem.
I was told that you only run into severe problems with model accuracy if the base rates are far from 50%. Accuracy feels pretty interpretable and meaningful here as the base rates are 30%-50%.
It depends on how much signal there is in your data. If the base rate is 60%, but there’s so little signal in the data that the Bayes-optimal predictions only vary between 55% and 65%, then even a perfect model isn’t going to do any better than chance on accuracy. Meanwhile the perfect model will have a poor AUC but at least one that is significantly different from baseline.
[ROC AUC] penalises you for having poor prediction even when you set the sensitivity (the threshold) to a bad parameter. The F Score is pretty simple, and doesn’t have this drawback—it’s just a combination of some fixed sensitivity and specificity.
I’m not really sure what you mean by this. There’s no such thing as an objectively “bad parameter” for sensitivity (well, unless your ROC curve is non-convex); it depends on the relative cost of type I and type II errors.
The F score isn’t comparable to AUC since the F score is defined for binary classifiers and the ROC AUC is only really meaningful for probabilistic classifiers (or I guess non-probabilitstic score-based ones like uncalibrated SVMs). To get an F score for a binary classifier you have to pick a single threshold, which seems even worse to me than any supposed penalization for picking “bad sensitivities.”
there is ongoing research and discussion of this, which is confusing because as far as math goes, it doesn’t seem like that hard of a problem.
Because different utility functions can rank models differently, the problem “find a utility-function-independent model statistic that is good at ranking classifiers” is ill-posed. A lot of debates over model scoring statistics seem to cash out to debates over which statistics seem to produce model selection that works well robustly over common real-world utility functions.
It depends on how much signal there is in your data. If the base rate is 60%, but there’s so little signal in the data that the Bayes-optimal predictions only vary between 55% and 65%, then even a perfect model isn’t going to do any better than chance on accuracy.
Makes sense.
I’m not really sure what you mean by this. There’s no such thing as an objectively “bad parameter” for sensitivity (well, unless your ROC curve is non-convex); it depends on the relative cost of type I and type II errors.
I think they both have their strengths and weaknesses. When you give your model to a non-statistician to use, you’ll set a decision threshold. If the ROC curve is non-convex, then yes, some regions are strictly dominated by others. Then area under the curve is a broken metric because it gives some weight to completely useless areas. You could replace the dud areas with the bits that they’re dominated by, but that’s inelegant. If the second derivative is near zero, then AUC still cares too much about regions that will still only be used for an extreme utility function.
So in a way it’s better to take a balanced F1 score, and maximise it. Then, you’re ignoring the performance of the model at implausible decision thresholds. If you are implicitly using a very wrong utility function, then at least people can easily call you out on it.
For example, here the two models have similar AUC but for the range of decision thresholds that you would plausibly set the blue model, blue is better—at least it’s clearly good at something.
Obviously, ROC has its advantages too and may be better overall, I’m just pointing out a couple of overlooked strengths of the simpler metric.
Because different utility functions can rank models differently, the problem “find a utility-function-independent model statistic that is good at ranking classifiers” is ill-posed. A lot of debates over model scoring statistics seem to cash out to debates over which statistics seem to produce model selection that works well robustly over common real-world utility functions.
In this case, with only selectivity ratings, you can’t do better than 50% accuracy (any person wants to date any other person with 50% probability). But with eye-color ratings, you can get it perfect.
Edit: I initially misread your remark. I tried to clarify the setup with:
In this blog post I’m restricting consideration to signals of the partners’ general selectivity and general desirability, without considering how their traits interact.
Is this ambiguous?
My impression is that there are significant structural correlations in your data that I don’t really understand the impact of. (For instance, at least if everyone rates everyone, I think the correlation of attr with attrAvg is guaranteed to be positive, even if attr is completely random.)
I may not fully parse what you have in mind, but I excluded the rater and ratee from the averages. This turns out not to be enough to avoid contamination for subtle reasons, so I made a further modification. I’ll be discussing this later, but if you’re wondering about this particular point, I’d be happy to now.
The relevant code is here. Your remark prompted me to check my code by replacing the ratings with random numbers drawn from a normal distribution. Using 7 ratings and 7 averages, the mean correlation is 0.003, with 23 negative and 26 positive.
Nitpick: There are only 10 distinct such correlations that are not 1 by definition.
Thanks, that was an oversight on my part. I’ve edited the text.
Model accuracy actually isn’t actually a great measure of predictive power, because it’s sensitive to base rates. (You at least mentioned the base rates, but it’s still hard to know how much to correct for the base rates when you’re interpreting the goodness of a classifier.)
I suppressed technical detail in this first post to make it more easily accessible to a general audience. I’m not sure whether this answers your question, but I used log loss as a measure of accuracy. The differentials were (approximately, the actual final figures are lower):
For Men: ~0.690 to ~0.500.
For Women: ~0.635 to ~0.567.
For Matches: ~0.432 to ~0.349
I’ll also be giving figures within the framework of recommendation systems in a later post.
As far as I know, if you don’t have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.
It wasn’t clear that this applied to the statement “we couldn’t improve on using these” (mainly because I forgot you weren’t considering interactions).
I excluded the rater and ratee from the averages.
Okay, that gets rid of most of my worries. I’m not sure it account for covariance between correlation estimates of different averages, so I’d be interested in seeing some bootstrapped confidence intervals). But perhaps I’m preempting future posts.
Also, thinking about it more, you point out a number of differences between correlations, and it’s not clear to me that those differences are significant as opposed to just noise.
I’m not sure whether this answers your question, but I used log loss as a measure of accuracy.
I was using “accuracy” in the technical sense, i.e., one minus what you call “Total Error” in your table. (It’s unfortunate that Wikipedia says scoring rules like log-loss are a measure of the “accuracy” of predictions! I believe the technical usage, that is, percentage properly classified for a binary classifier, is a more common usage in machine learning.)
The total error of a model is in general not super informative because it depends on the base rate of each class in your data, as well as the threshold that you choose to convert your probabilistic classifier into a binary one. That’s why I generally prefer to see likelihood ratios, as you just reported, or ROC AUC scores (which integrates over a range of thresholds).
(Although apparently using AUC for model comparison is questionable too, because it’s noisy and incoherent in some circumstances and doesn’t penalize miscalibration, so you should use the H measure instead. I mostly like it as a relatively interpretable, utility-function-independent rough index of a model’s usefulness/discriminative ability, not a model comparison criterion.)
Also, thinking about it more, you point out a number of differences between correlations, and it’s not clear to me that those differences are significant as opposed to just noise.
Nice writeup! A couple comments:
I don’t think this is true. Consider the following model:
There is only one feature, eye color. The population is split 50-50 between brown and blue eyes. People want to date other people iff they are of the same eye color. Everyone’s ratings of eye color are perfect.
In this case, with only selectivity ratings, you can’t do better than 50% accuracy (any person wants to date any other person with 50% probability). But with eye-color ratings, you can get it perfect.
My impression is that there are significant structural correlations in your data that I don’t really understand the impact of. (For instance, at least if everyone rates everyone, I think the correlation of
attr
withattrAvg
is guaranteed to be positive, even ifattr
is completely random.)As a result, I’m having a hard time interpreting things like the fact that
likeAvg
is more strongly correlated withattr
than withlike
. I’m also having a hard time verifying your interpretations of the observations that you make about this heatmap, because I’m not sure to what extent they are confounded by the structural correlations.Nitpick: There are only 10 distinct such correlations that are not 1 by definition.
Model accuracy actually isn’t actually a great measure of predictive power, because it’s sensitive to base rates. (You at least mentioned the base rates, but it’s still hard to know how much to correct for the base rates when you’re interpreting the goodness of a classifier.)
As far as I know, if you don’t have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.
I was told that you only run into severe problems with model accuracy if the base rates are far from 50%. Accuracy feels pretty interpretable and meaningful here as the base rates are 30%-50%.
Although ROC area under curve seems to have an awkward downside in that it penalises you for having poor prediction even when you set the sensitivity (the threshold) to a bad parameter. The F Score is pretty simple, and doesn’t have this drawback—it’s just a combination of some fixed sensitivity and specificity.
As you point out, there is ongoing research and discussion of this, which is confusing because as far as math goes, it doesn’t seem like that hard of a problem.
It depends on how much signal there is in your data. If the base rate is 60%, but there’s so little signal in the data that the Bayes-optimal predictions only vary between 55% and 65%, then even a perfect model isn’t going to do any better than chance on accuracy. Meanwhile the perfect model will have a poor AUC but at least one that is significantly different from baseline.
I’m not really sure what you mean by this. There’s no such thing as an objectively “bad parameter” for sensitivity (well, unless your ROC curve is non-convex); it depends on the relative cost of type I and type II errors.
The F score isn’t comparable to AUC since the F score is defined for binary classifiers and the ROC AUC is only really meaningful for probabilistic classifiers (or I guess non-probabilitstic score-based ones like uncalibrated SVMs). To get an F score for a binary classifier you have to pick a single threshold, which seems even worse to me than any supposed penalization for picking “bad sensitivities.”
Because different utility functions can rank models differently, the problem “find a utility-function-independent model statistic that is good at ranking classifiers” is ill-posed. A lot of debates over model scoring statistics seem to cash out to debates over which statistics seem to produce model selection that works well robustly over common real-world utility functions.
Makes sense.
I think they both have their strengths and weaknesses. When you give your model to a non-statistician to use, you’ll set a decision threshold. If the ROC curve is non-convex, then yes, some regions are strictly dominated by others. Then area under the curve is a broken metric because it gives some weight to completely useless areas. You could replace the dud areas with the bits that they’re dominated by, but that’s inelegant. If the second derivative is near zero, then AUC still cares too much about regions that will still only be used for an extreme utility function.
So in a way it’s better to take a balanced F1 score, and maximise it. Then, you’re ignoring the performance of the model at implausible decision thresholds. If you are implicitly using a very wrong utility function, then at least people can easily call you out on it.
For example, here the two models have similar AUC but for the range of decision thresholds that you would plausibly set the blue model, blue is better—at least it’s clearly good at something.
Obviously, ROC has its advantages too and may be better overall, I’m just pointing out a couple of overlooked strengths of the simpler metric.
Yes.
Thanks Ben!
Edit: I initially misread your remark. I tried to clarify the setup with:
In this blog post I’m restricting consideration to signals of the partners’ general selectivity and general desirability, without considering how their traits interact.
Is this ambiguous?
I may not fully parse what you have in mind, but I excluded the rater and ratee from the averages. This turns out not to be enough to avoid contamination for subtle reasons, so I made a further modification. I’ll be discussing this later, but if you’re wondering about this particular point, I’d be happy to now.
The relevant code is here. Your remark prompted me to check my code by replacing the ratings with random numbers drawn from a normal distribution. Using 7 ratings and 7 averages, the mean correlation is 0.003, with 23 negative and 26 positive.
Thanks, that was an oversight on my part. I’ve edited the text.
I suppressed technical detail in this first post to make it more easily accessible to a general audience. I’m not sure whether this answers your question, but I used log loss as a measure of accuracy. The differentials were (approximately, the actual final figures are lower):
For Men: ~0.690 to ~0.500. For Women: ~0.635 to ~0.567. For Matches: ~0.432 to ~0.349
I’ll also be giving figures within the framework of recommendation systems in a later post.
Thanks, I’ve been meaning to look into this.
It wasn’t clear that this applied to the statement “we couldn’t improve on using these” (mainly because I forgot you weren’t considering interactions).
Okay, that gets rid of most of my worries. I’m not sure it account for covariance between correlation estimates of different averages, so I’d be interested in seeing some bootstrapped confidence intervals). But perhaps I’m preempting future posts.
Also, thinking about it more, you point out a number of differences between correlations, and it’s not clear to me that those differences are significant as opposed to just noise.
I was using “accuracy” in the technical sense, i.e., one minus what you call “Total Error” in your table. (It’s unfortunate that Wikipedia says scoring rules like log-loss are a measure of the “accuracy” of predictions! I believe the technical usage, that is, percentage properly classified for a binary classifier, is a more common usage in machine learning.)
The total error of a model is in general not super informative because it depends on the base rate of each class in your data, as well as the threshold that you choose to convert your probabilistic classifier into a binary one. That’s why I generally prefer to see likelihood ratios, as you just reported, or ROC AUC scores (which integrates over a range of thresholds).
(Although apparently using AUC for model comparison is questionable too, because it’s noisy and incoherent in some circumstances and doesn’t penalize miscalibration, so you should use the H measure instead. I mostly like it as a relatively interpretable, utility-function-independent rough index of a model’s usefulness/discriminative ability, not a model comparison criterion.)
More to follow (about to sleep), but regarding
What do you have in mind specifically?