Methodology for predicting speed dating participants’ decisions
In my last post I described phenomena that I used to predict speed dating participants’ decisions by estimating the participants’ general selectivity and perceived desirability. I was planning on following up with a discussion of the phenomena that I used to refine the model by taking into account differences between individuals. But since comments focused on methodology rather than the empirical phenomena, I decided to write about methodology first, so that readers wouldn’t have to disbelief while reading my next post.
This post is more dense and technical than my last one. I wrote it for readers who want to check the details of the work, or who have strong interest in statistics and/or machine learning. If you don’t fall into either category but are interest in the series, you can skip it without loss of continuity.
Here I’ll address three points:
The situation that I attempted to simulate and how faithful one should expect the simulation to be.
The exact definitions of rating averages that I referenced in my last post
My criteria for including a feature in the model.
The underlying question that I attempted to address is “Suppose that a speed dating company wanted to organize events with more matches. How could machine learning help?”
As Ryan Carey pointed out, the model that I developed uses data about other speed dates that participants had been on to predict decisions on a given speed date. It is possible to make nontrivial predictions exclusively using information that was available before the participants attended the events, but I haven’t systematically explored how well one could do. So the model that I developed is potentially useful only in the special case where participants had attended similar past events.
In fact, the participants in the dataset attended only a single speed dating event, not multiple events, so it’s not possible to directly check whether the model would in fact predict behavior at future events based on past events. I instead simulated a situation where participants had attended similar events in the past, by imagining that for a given date, all other dates that the pair of people had been on had occurred in a past event.
It’s very likely that the simulation overstates the predictive power that the model would give in practice, if for no other reason than regression to the mean. One example of this is that the most popular participants are more likely than usual to have been at their best on the day of the event than the other participants are, so that confidence that one can have that someone who was chosen by most of their dates at an event will be chosen by partners at a different event is lower than the confidence that one can have that the person will be chosen by partners at the same event.
If one were to apply the model in a real world setting, one would collect data that allowed one to quantify the expected regression to the mean, and also to improve the model.
Average ratings
Conceptually, the foundation of the model is the idea that you can infer a participant’s traits from:
Averages of the ratings that members of the opposite sex gave the participant (one average for each type of rating).
Averages of the ratings that the participant gave members of the opposite sex.
For the sake of limiting unnecessary verbiage, it’s useful to think of the decision that a participant makes on a partner as a “rating,” where a ‘no’ decision corresponds to a rating of 0 and a ‘yes’ decision corresponds to a rating of 1.
The first point to make is that given a rater / ratee pair, we need to exclude from consideration both the ratings that the rater gave the ratee from consideration, and the ratings that the ratee gave the rater. This is because we’re trying to predict whether two people who have never been on a speed date would be interested in seeing each other again if they were to go on a speed date.
Excluding these ratings wouldn’t be crucial if the speed dating events involved each person going on thousands of speed dates: in that case, the ratings that the two people had given each other would correspond to slight perturbations of the averages. But when an event involves only ~15 people, the impact of a single rating on somebody’s average can be large enough so that failing to exclude the individuals’ ratings of one another would substantially overstate the predictive power of the model while simultaneously obscuring what was going on.
Given a rating type R, and two participants A and B whose decisions we’re trying to predict, let R(A,B) be the rating that A gave B, and let R(B) be the sum of the ratings that were given to B. Let N be the number of people who rated B. One might think that the right features to look at are
[R(B) - R(A,B)]/(N − 1) (**)
But these features are still contaminated with the decisions we’re trying to predict. To see this, consider the case of a dataset including only a single ratee B. In this case, R(B) is constant, so when the rating type R is ‘decision,’ the feature’s value depends only on R(A,B), so that one can solve for R(A,B) in terms of the feature.
Even though we have many more than one rater, the contamination is still an issue. Some machine learning algorithms are capable of learning the identities of individual raters, and if they do so, they can learn how to solve (**) for each individual rater.
Rather than using (**), we imagine that at the event, B had been on a date with someone other than A, who we call a “surrogate” of A. We model the surrogate of A using another participant A’ that B dated. Conceptually, A’ is a randomly selected participant amongst the participants who B dated, but literally picking one at random would break the symmetry of the data in a way that could dilute the statistical power of the data, so I instead made a uniform choice to replace A by the participant who B would have dated that round if the speed dating schedule had been slightly different.
[R(B) - R(A,B) + R(A’, B)]/N
In the special case where the rating type is “decision,” the averages correspond to frequencies, and for easy of comparison with other features these are most naturally replaced by their log odds ratios, so I did this.
I normalized these averages by subtracting off the average of all ratings that participants of B’s gender would have received at the event had the surrogates of A and B attended the event in lieu of A and B. This washes out heterogeneity in raters’ rating scales from event to event.
Distinguishing noise from signal: my criteria for including a feature
In order to avoid overfitting the dataset in a way that reduces the generalizability of the findings, I imposed a high threshold for features to meet to be included in the model. From the point of view of discovery, this was very helpful insofar as it helped me discover the core phenomena that I used.
One could argue that the filters are collectively too strict, but I’ve chosen to use them for several reasons:
The tendency to see signal in noise is so strong that it seems that it’s nearly always the case that when people make effort to avoid it, they’re not doing enough, so it seems better to err on the conservative side.
I wanted to make an unambiguous case for the features that I did include adding incremental predictive power. I’m fairly confident that to the extent that the factors that influenced the participants at the event reflect general human behavioral tendencies, the predictive power of the features that I identified also generalizes. My main source of uncertainty is that nobody’s checked my work in detail.
From an expository point of view, the effect sizes of the features that I excluded are arguably too small for them to warrant comment.
If I were strictly focused on optimizing for predictive power, I would have included features that improve predictive power by a tiny margin with 60% confidence, but I had no reason to do so: even in aggregate, the resulting difference in predictive power wouldn’t have been striking, it’s unlikely that anyone will actually use the model, and if even if someone does, there will be opportunities to collect more data and make a better model.
What’s interesting is not so much exactly how predictive the model is, but what the main driving factors are and how they interact.
I’ve enumerated the criteria below. In practice, there’s a fair amount of redundancy between them: if a feature didn’t pass through one of them, it usually failed to pass through at least one other. But this fact only emerged gradually, and I used each individually at different times.
I tried to keep the number of features that I used small
The dataset that I’ve been working is derived from 9 speed dating events involving ~160 people of each gender, for a total of ~3000 dates. The size is sufficiently large so that we can hope to get a broad sense for what’s going on, but not sufficiently large so that we can determine the influence of individual idiosyncracies in great detail. If we hope for too much, we’re apt to base our model on patterns that don’t generalize, regardless of how much cross checking we do.
My final model uses only 5 features to predict men’s decisions and only 3 features to predict women’s decisions.
I only included a feature when the fact that it increased the model’s performance was in consonance with my intuitions
For example, I found that empirically, people who expressed a preference for people who share their interests were considered to be undesirable, but given the small size of the dataset and the absence of evidence for the phenomenon coming from other sources, using this to make predictions seemed ill-advised.
I restricted myself to using features that were derived from a relatively large number of examples, both of speed dates and of people.
The female engineering graduate students in the sample showed a very strong preference for male engineering graduate students over other men. They were also far more receptive to dating the male engineering students than other women were. The engineering/engineering cross feature passed through all other filters that I used aside from this one, but though there were 40 dates between engineering graduate students, they involved only 6 women, so I dropped the feature.
I used cross validation
Suppose there were 20 people who have some trait X, and that most of them were considered about as desirable as usual, but 2 of them were rejected by everyone. In this case it might so that it might look like people with trait X are a little less likely to be chosen. We don’t want to base our model on participants’ responses to only two people.
If we split the dataset into two subsets, train our model on one, and test it on the other, then if one of the unpopular people is in the train set and one is in the test set, including the feature could increase the model’s performance on the test set. With a dataset of this size, the boost in performance could be large enough so that one would be inclined to include the feature based on the increase in performance.
The standard method used to avoid this problem is cross-validation: instead of using a single train/test split, use many train/test splits. If including a feature in the model improves performance for a large fraction of train/test splits of sufficiently low redundancy between them, that can provide much stronger evidence that that the predictive power of the feature will generalize.
For each event, I split the data into a test set consisting the event, and a train set consisting of all other events. With this setup:
When both of unpopular people are in the train set, including trait X as a feature makes the model’s predictions for the test set worse.
In the instances where one of the people is in the train set and the other is in the test set, including the feature may improve performance. But there are at most 2 such instances out of 9 train/test splits.
Should it happen that both people were at the same event, including the feature won’t improve performance for any of the events, because when the two people are in the test set, there’s no pattern in the train set for the model to pick up on.
The fact that the model never does better in this case case is helpful, because flukish occurrences are more likely to be concentrated in a single event than they are to be split up over a different events: for example, maybe the two unusual people are friends who have a lot in common and signed up for the same event together.
I required that when predictions are generated in this way (with one train/test split for each event), every feature that I include improve performance
When we average all predictions made across the whole dataset.
For a majority of events when we look at the data by event.
For a majority of raters when we look at the data by rater.
For a majority of ratees when we look at the dataset by ratee.
Having spent a long time with the dataset, it was more or less clear to me that that the train/test splits that I used were enough, but I realized this may not be a priori clear, so I did a final check in which before forming the train/test splits, I removed each individual from the dataset in turn, and each wave from the dataset in turn. This is in the spirit of leave-one-out cross validation. It turns out to be overkill: (1)-(4) are never violated for any feature that I used, except for one that occasionally fell short of meeting criterion (4) by a single ratee.
I measured performance using “log loss,” which is a technical measure of the quality of probabilistic predictions. I omit a description of it because I figure that readers either already know it or don’t have the time/energy to absorb an explanation, but I can write about it if someone would like.
The tables below show how much predictive power increases when we include a given feature, starting from a base consisting of all other features that we used. Here the columns correspond to criteria (1)-(4), and the numbers in the “Avg boost” column are drops in log loss. Since I haven’t defined the features, I’ve left them unlabeled, but I’ll label them once I’ve written my next post.
Women’s decisions:
Feature |
Avg boost |
% events |
% of raters |
% ratees |
1 |
0.0874 |
100% |
63% |
83% |
2 |
0.0645 |
100% |
75% |
69% |
3 |
0.0035 |
100% |
58% |
55% |
Men’s decisions:
Feature |
Avg boost |
% events |
% of raters |
% ratees |
1 |
0.1162 |
100% |
64% |
90% |
2 |
0.0874 |
100% |
84% |
72% |
3 |
0.0030 |
78% |
56% |
53% |
4 |
0.0024 |
89% |
55% |
55% |
5 |
0.0017 |
67% |
64% |
67% |
The tables and the criteria that I described don’t tell the whole story as far as overfitting goes: the features depend on numerical parameters, which are themselves overfit to the model, in the sense that to some extent I picked them with a view toward maximizing the numbers in the table.
But this sort of overfitting corresponds to optimizing the expected performance of the model on hypothetical future datasets, which is the opposite of picking features that are likely to be predictive only in the context of the dataset. It overstates the predictive power of the model in more general contexts, but it’s simultaneously the case that not doing it would produce a model that performs worse in general settings.
The choices that I made seem fairly natural, and to the extent that they overstate the model’s predictive power, the effect seems likely to be minor. If one had more data, one could obtain improved estimates for the numerical parameters. The more serious distortion in potential predictive power comes from the absence of data on participants across multiple events.
Thanks to Brian Tomasik for catching an error in an earlier version of this post.
I still don’t understand how you’ve built this model on a basic level. I’m not yet an expert at machine learning, so I’ll ask a couple of questions, some of which are nitpicky, to improve my understanding:
Ultimately, we’re trying to infer the participant’s rating of another participant, right? And then you mention traits. Here are you talking about how attractive, fun, ambitious, et cetera the person is? And then, are you also inferring that people care different amounts about these traits?
So you’re talking about the average of how all other individuals rated B. Are you averaging across fun, ambition, intelligence and sincerity here to find out B’s overall popularity, or are you trying to figure out how fun, ambitious or intelligent they are?
No problem, the post is a lot to take in at once. Thanks very much for your interest, more than anything else I’m happy that someone is reading my posts :-).
In my last post, I made reference to features like “average sincerity rating” without being precise about what I meant. Here I’m just giving precise definitions
Ultimately we’re trying to infer the participant’s decision on another partner.
The correlation matrixes from my last post showed that if you want to predict e.g. how a woman will rate a man’s ambition, you get predictive power by looking at the average of how other women rate his ambition, and moreover the predictive power is greater than the predictive power that one obtains by looking at the average of how women rate his attractiveness, or how fun he is, or how intelligent he is, or how sincere he is.
So the average of ratings of his ambition are picking up on some underlying trait that he possesses. The simplest guess is that the trait is what we think of when we think of ambition, but in practice it’s really whatever about him makes women perceive him to be ambitious – it could be that he seems very focused on work, it could be that he tends to wear business suits, etc.
Once we form the average, we have a measurement of the underlying trait and we can use it for any number of things. We can try to use it to estimate how desirable women find him on average. We can try to use it to estimate how selective he is on average. We can compare it with the corresponding metric for women and explore whether men tend to prefer women who are of a similar level of ambition in general.
There are two things that you might be asking here. If you’re asking about how much people tend to care about a trait in general, one can do logistic regression and look at regression coefficients.
But maybe you’re asking about whether I’m inferring how much individual people care about the different traits, preempting my next post. One could do logistic regression for each individual, but there are ~15 dates per person and ~5 traits, so one doesn’t really have enough data for this to be informative. My impression is that the right way to go about this is via multilevel modeling, but I haven’t yet figured out how to adapt the general methodology to my particular situation.
What I did find is that in the special cases of attractiveness and fun, one has enough statistical power so that one can extract nontrivial information that yields incremental predictive power by just looking at the correlations between a participant’s decisions and his or her partner’s attractiveness and fun averages, and use it to get incremental predictive power.
R is a fixed rating type, so the latter. I do average across averages of ratings on different dimensions to determine overall popularity, as I discuss in the section “A composite index to closely approximate a ratee’s desirability” in my last post, but that’s a separate matter.
Thanks for the excellent answers. That all makes sense and clears up a lot in my mind about this post and the previous one. Just two quick question/comments for now:
But maybe you’re asking about whether I’m inferring how much individual people care about the different traits, preempting my next post. One could do logistic regression for each individual, but there are ~15 dates per person and ~5 traits, so one doesn’t really have enough data for this to be informative. My impression is that the right way to go about this is via multilevel modeling, but I haven’t yet figured out how to adapt the general methodology to my particular situation.
Yes, the latter is the question. I imagine that you might be able to use a collaborative filtering algorithm as described here. In the video, Andrew Ng supposes that you’re matching films with individuals, using a sparsely populated matrix of match values, assuming that you know which genres different individuals like. Your problem seems identical, just you know the features of the people, rather than their tastes.
I don’t know about multilevel modelling.
So as I understand, you still used data that you wouldn’t’ve had in practice? Would it be a viable alternative to just take the average from dates that preceded the one you’re trying to predict? In general, predicting future from past seems simple and good if the data is time-labelled, though I might be missing the issue here.
Thanks again for your interest – your questions have helped me clarify my thinking.
I had tried this: the code that I used is here: for each date at an event, I looked at the rating matrix associated with the event but with a missing entry corresponding to the date, and then used an R library called recommenderlab to produce a guess for the missing rating.
The situation is that one doesn’t get almost anything more than what one would get if one just uses the average rating that the rater gave and the average rating that the ratee received. A typical example of how the actual ratings and guesses compare is given here – here the raters are men, the ratees are women, and the rating type is attractiveness. The rows correspond to raters and the columns to ratees. The rows have been normalized, so that the sum of a given rater’s ratings is 0. You can see that after controlling for variation in rating scales, the guesses for a given ratee are virtually identical across raters.
Yet collaborative filtering appears to have been applied successfully in the context of online dating, for example, as reported on by Brozovsky and Petricek (2007) and the papers that cite it, even in contexts where the average number of ratings per person is not so large, so I don’t know why I didn’t have more success with this approach.
I’ve been exploring this over the past few days, and will probably write about it in my next post. For simplicity, say we want to model the probability of a decision by the k’th rater as
logit(P) = A(k) + B(k)*attrAvg
where P is the decision probability, A(k) and B(k) are constants specific to the k’th rater, and attrAvg is the average attractiveness of the rater’s partner. Rather than determining A(k) and B(k) by simply doing a linear regression for each rater, we can instead fit A(k) and B(k) using a prior on the distributions of A(k) and B(k): for example, one can assume that they’re normally distributed. My first impression is that determining the means of the hypothesized normal distributions is simply a matter of fitting
logit(P) = A + B*attrAvg
where A and B are uniform over raters, and that the nontrivial part is determining the standard deviations of the hypothesized distributions while simultaneously estimating all of the A(k) and B(k).
The reason why I hesitated to go in that direction issue is just that the sample sizes are already small: if one is talking about a speed dating event with 16 men and 16 women, one can use the rating averages from the first 8 rounds and use them to predict what will happen in the last 8 rounds, but the loss in statistical power could be very large: given that women’s decisions were yes only ~33% of the time, the decision frequencies when the rater is a woman would be based only on 2-3 decisions per woman.
But I should do a cross check, and see whether predictive power diminishes as a function of how late a date occurred in the event. I’ll do this and get back to you.