Evaluating Multiple Metrics (where not all are required)
This is my first article, and I’m submitting it in the discussion forum, so hopefully I’ve done this correctly and we can discuss!
Anyway, I have a group of friends who are really interested in movies, and they feel very strongly about them. I find their convictions interesting. Specifically the way they adamantly argue that, for instance, Midnight in Paris is a “better” movie than Bridesmaids or whatever. I got to thinking about how one would create metrics by which you could evaluate any movie.
First attempt: A simple scale by which you give rankings (1-10) to a list of movie attributes (the metrics), sum up the total, and highest number is the best movie.
Some metrics might be:
Plot/Story
Acting
Effects, Costumes, Editing, etc.
Script/Dialogue
Humor
Drama/Passion
Suspense
So we can argue about what the metrics should be and how many we need, but since we’re not worried about justifying our system objectively we can include whatever criteria we want. We could even add a weighting component so some metrics are worth more than others. My system can even be different than yours. The problem, though, is that in reality movies don’t need to excel at all metrics to be perfect for what they are. Would Schindler’s List be a better movie if they were cracking jokes the whole time? Would 12 Angry Men be better if it had more special effects? And it’s a little weird to evaluate the acting in Up or Toy Story 3. (No offense to voice actors.)
The idea of ranking movies is really about the challenge of comparing things that are the same class (movies) but very different types (comedy, horror, drama, etc) -- in content, goal, method, etc. Is it possible to come up with metrics by which to compare anything in the class regardless of type? Assuming you can come up with which metrics you find valuable/relevant, some of them will apply to one type but not another. But you also can’t completely disregard metrics that are not common between all types, because you’ve just said you find them valuable/relevant (in this case, to your enjoyment of a movie).
These thoughts led me to the question which I will pose here: How do you evaluate items in a class based on multiple metrics when not all metrics are ALWAYS relevant?
Some brainstorming to try to answer that question (modifying the system proposed above):
Allow “N/A” for a metric and then divide the total points by the total possible based on applicable metrics. But this ignores, for example, humorless movies that could have used some humor.
Ok, so maybe give a movie with no humor a 10⁄10 in the humor metric IF it was perfect without it, or some other X/10 if it needed some humor. But that seems to inflate the movie’s rating by giving some amount of credit for an attribute that it didn’t actually have.
I briefly considered having flexible weightings assigned subjectively to the metrics for each movie rated. But the whole point of this is to have standard criteria for all movies—not different scales.
Anyway, any ideas? Are there already systems for this sort of thing in different arenas of which I’m not aware? Could you develop a system for this sort of evaluation that could also be used to evaluate businesses, school classes, marketing techniques, or just about anything else?
What is with the persistent bolding of your pet words?
Just thought I would try to make it easier to follow. An alternative would have been to declare my terms, I guess. I haven’t really developed a strategy for that—just thought I’d try this.
Bolding or italicizing each special term the first time it appears in the text and writing it in regular typeface afterwards would probably read better, while still drawing attention to the relevant special concept words. People can keep picking out the word better without the typeface once they’ve been primed by the first mention to assume the word denotes an important concept.
I liked this idea, which carried the added bonus of only taking a few second to implement. Better?
Looks good to me now.
Thanks. Do you think the vote downs have to do with the content? Is this not a relevant topic for this forum?
I guess the downvotes might be a combination of the thing post being a lot more in the idea stage than a worked out solution stage and it being about rating movies, which as itself isn’t a very relevant topic.
The general idea of working out preferences using vectors instead of scalars does seem like a forum relevant topic to me, but your post leaves the details of making an actual working implementation, coming up with interesting use cases beyond movies and figuring out how the vector approach would be a significant improvement over a scalar approach in them up to the reader, so it’s a bit thin as it stands.
Here are some possible definitions you might consider using.
Class: A concentration of unusually high probability density in Thingspace.
Type: A subclass. An even denser area of thingspace or conceptspace within a cluster of things.
Metric: a scale you use to measure a single trait of something. In humans, that could be height, weight, hair color, etc. In order to be useful, a metric must give you further information about that thing as opposed to other things in its class/type (there must be significantly more variability along that dimension than others, in terms of thingspace).
In regards to the article itself, it highlights the difficulty of projecting a multidimensional space (with the number of dimensions equal to the number of metrics you’re using) and a complex distribution of “goodness” within that space to a single dimension of goodness with minimal complexity and minimal loss of information.
Why did you think that? Have you paid attention to your own experience reading things with bold? I recommend reading Razib Khan and paying attention. I find that his use of bold makes it more difficult to read the whole article, but easy to read just the bold passages, which is usually the right choice.
“How good a movie is” is not a question for which there is a fact of the matter, because expanding out the definition of the word “good” brings in all the complexity of human preference. That’s not unambiguous until you specify a particular human and a particular priming state (or particular weighted combination thereof). Things like costumes and suspense correlate with movies being good, for most humans in most states, but that’s a mere empirical fact, not part of the definition of goodness.
Being ambiguous or involving very complicated considerations doesn’t make a question meaningless. There is still fact of the matter, even if it’s not easy to find and there are multiple reasonable disambiguations that motivate focusing on multiple particular facts.
I tried to acknowledge that the rankings in this case are completely subjective. Maybe it would help to think about it like this. Let’s say instead we have a data set. We’ll simplify to 4 metrics: Plot, Acting, Humor, and Suspense. We’re given data for 3 movies, for each movie a ranking for these 4 metrics, respectively:
Groundhog Day 9 9 10 5 Terminator 8 8 6 9 Achorman 6 9 10 2
Based on this, what are some ways to evaluate this data? We’re not satisfied that just summing the rankings for each metric comes up with an accurate ranking for the film overall. So how else can we do it?
Empirically determine what formula most closely matches overall impressions in the real world, avoiding over-fitting by penalizing the formulas for complexity. The “sum the scores” would simply be P+A+H+S. A weighted sum would be k1P+k2A+k3H+k4S. Perhaps humor and suspense are found to correlate positively with rating when considered individually, but interfere negatively with each other. So we might go with k1P+k2A+k3H+k4S-k5(H*S). Each additional bit of complexity in the formula must double the predictive power of your formula (halve your error).
We would start with the data and possible formulas (probably weighted by complexity). We would then plug in the data for each formula, seeing how well each one predicts it. The formula which most efficiently predicts movie ratings based on these dimensions is the one we would use.
Yes! That helps. My question, then, is what to plug into that formula if a metric SOMETIMES matters.
e.g. If 9 9 9 9 isn’t necessarily better than 9 9 9 0.
There are probably some additional questions to think of, but I’m not sure what they are. And I’m not entirely sure this is possible...that’s why I brought it up.
It is entirely possible, and feel free to ask more questions.
I find that it’s helpful to visualize the shape of the space I am operating in, which in this case is a 5-dimensional space (the dimensions are Plot, Acting, Humor, Suspense, and Overall Rating). However, many people find it difficult to visualize more than 3 dimensions, so I will describe only the interaction of Humor and Suspense on Overall Rating.”
In this case, let Humor (H) be the east/west direction, Suspense (S) be the north south direction, and Overall Rating (R) be the altitude. We can now visualize a landscape that corresponds to these variables. Here are some possible landscapes and what we can infer from them:
*Flat, with no slope or features (The audience doesn’t care about either H or S)
*Sloped up as we go northeast (The audience likes humor and suspense together)
*Saddle shaped with the high points to the northwest and southeast (The audience likes H or S independently, but not together)
*Mountainous (The audience has complex tastes).
You would then want to find the equation that best fit this terrain you have. Usually, the best fit is linear (which you would see as a sloped terrain). However, you can find better equations when it isn’t. You do have to be careful not to over-fit: a good rule of thumb is that if it takes more information to approximate your data than is contained in the data itself, you’re doing something wrong.
I tried visualizing but I don’t know how that helps me construct a formula. I would imagine, in your example, the landscape would be mountainous. One movie may have both great suspense and great humor and be a great movie...another may have both great suspense and great humor and be just an okay movie. But then perhaps there is a movie with very low amounts of humor or suspense that is still a good movie for other reasons. So in that case neither of these metrics would be good predictors for that movie.
That’s kind of the core of the issue, as your exercise illustrates. Since in any given case, and metric can be a complete non-predictor of the outcome, I don’t know any way to construct the formula. It seems like you’d have to find some way to both include and exclude metrics based on (something).
So maybe the answer is the N/A thing I considered. Valuing movie metrics is not about quantifying how much of each metric is packed into a film. It is about gauging how well these metrics are used. So maybe you could give Schindler’s List “N/A” in the humor metric and some other largely humorless movie a 2⁄10 based on the fact that you felt the other movie needed humor and didn’t have much. In that way, it seems all metrics not stated as N/A would have value and you would just need to figure out how to weight them. For instance:
A 9 9 9 9 wouldn’t necessarily score a better total than a 9 9 9 N/A...but it might, if the last category was weighted higher than one/some of the others.