Applied. Looks good. Might decide it’s not worth it, but you make a good case.
One thing. 0 to 10 ratings are utterly useless. The median is almost always around 7, for almost anything. Please give us calibrated statistics, not subjective pseudo-quantities where most of the contribution is from noise and offset.
Reminds me of business planning types ranking alternatives 1..n and then treating the indexes as utilities. ick. TYPE ERROR.
We’ve actually noticed in our weekly sessions that our nice official-looking yes-we’re-gathering-data rate-from-1-to-5 feedback forms don’t seem to correlate with how much people seem to visibly enjoy the session—mostly the ratings seem pretty constant. (We’re still collecting useful data off the verbal comments.) If anyone knows a standard fix for this then PLEASE LET US KNOW.
I’d suggest measuring the Net Promoter Score (NPS) (link). It’s used in business as a better measure of customer satisfaction than more traditional measures. See here for evidence, sorry for the not-free link.
“On a scale of 0-10, how likely would you be to recommend the minicamp to a friend or colleague?”
“What is the most important reason for your recommendation?
To interpret, split the responses into 3 groups:
9-10: Promoter—people who will be active advocates.
7-8: Passive—people who are generally positive, but aren’t going to do anything about it.
0-6: Detractor—people who are lukewarm (which will turn others off) or will actively advocate against you
NPS = [% who are Promoters] - [% who are Detractors]. Good vs. bad NPS varies by context, but +20-30% is generally very good. The followup question is a good way to identify key strengths and high priority areas to improve.
NPS is a really valuable concept. Means and medians are pretty worthless compared to identifying the percentage in each class, and it’s sobering to realize that a 6 is a detractor score.
(Personal anecdote: I went to a movie theater, watched a movie, and near the end, during an intense confrontation between the hero and villain, the film broke. I was patient, but when they sent me an email later asking me the NPS question, I gave it a 6. I mean, it wasn’t that bad. Then two free movie tickets came in the mail, with a plea to try them out again.
I hadn’t realized it, but I had already put that theater in my “never go again” file, since why give them another chance? I then read The Ultimate Question for unrelated reasons, and had that experience in my mind the whole time.)
Good anecdote. It made me realize that I had just 20 minutes ago made a damning non-recommendation to a friend based off of a single bad experience after a handful of good ones.
Another thing you could do is measure in a more granular way—ask for NPS about particular sessions. You could do this after each session or at the end of each day. This would help you narrow down what sessions are and are not working, and why.
You do have to be careful not to overburden people by asking them for too much detailed feedback too frequently, otherwise they’ll get survey fatigue and the quality of responses will markedly decline. Hence, I would resist the temptation to ask more than 1-2 questions about any particular session. If there are any that are markedly well/poorly received, you can follow up on those later.
One idea (which you might be doing already) is making the people collecting the data DIFFERENT from the people organizing/running the sessions.
For example, if Bob organizes and runs a session, and everyone likes Bob, but thinks that the session was so-so, they may be less willing to write negative things down if they know Bob is the one collecting and analyzing data.
If Bob runs the sessions, then SALLY should come in at the end and say something like “Well we want to make these better, so I’M gathering information of ways to improve, etc”
Even if Bob eventually gets the negative information, I think people might be more likely to provide it to Sally (one step removed) than to Bob directly.
(Even better: Nameless Guy organizes a session. Bob teaches session (making sure everyone knows this is NAMELESS’ session, and Bob is just the mouthpiece.)
Also, I would say that verbal comments are generally MUCH more useful than Likert scale information anyways. It’s better to be getting good comments, and bad Likert scores than vice versa.
Back when I did training for a living, my experience was that those forms were primarily useful for keeping my boss happy. The one question that was sometimes useful was asking people what they enjoyed most and least about the class, and what they would change about it. Even more useful was asking that question of people to their faces. Most useful was testing to determine what they had actually learned, if anything.
I’ve seen “rate from 1 to 5, with 3 excluded”, which should be equivalent to “rate from 1 to 4″ but feels substantially different. But there are probably better ones.
In this category of tricks, somebody (I forget who) used a rating scale where you assigned a score of 1, 3, or 9. Which should be equivalent to “rate from 1 to 3”, but...
Then maybe “1 to 4, excluding 3” or “1 to 5, excluding 4″, to rule out the lazy answer “everything’s basically fine”. That might force people to find an explanation whenever they feel the thing is good but not perfect.
If you start getting 5s too frequently, then it’s probably not a good trick.
Why not go all the way and just use a plus-minus-zero system like LW ratings (and much of the rest of the internet)? Youtube had an interesting chart before they switched from 5 star rating systems to the like-dislike system showing how useless the star ratings were. But that’s non-mandatory so its very different.
Another thing you could do is measure in a more granular way—ask for NPS about particular sessions. You could do this after each session or at the end of each day. This would help you narrow down what sessions are and are not working, and why.
You do have to be careful not to overburden people by asking them for too much detailed feedback too frequently, otherwise they’ll get survey fatigue and the quality of responses will markedly decline. Hence, I would resist the temptation to ask more than 1-2 questions about any particular session. If there are any that are markedly well/poorly received, you can follow up on those later.
You could have a rubric without any numbers, just 10 sentences or so where participants could circle those that apply. E.g. “I learned techniques in this session that I will apply at least once a week in my everyday life”, “Some aspects of this session were kind of boring”, “This session was better presented than a typical college lecture”, etc.
You could try a variant of this (give someone a d10 and a d6, hide roll from surveyor, if the d6 comes up 1 they give you a 1-10 rating based on the d10 and are otherwise honest) but this may not be useful in cases where people aren’t deliberately lying to you, and is probably only worth it if you have enough sample size to wipe out random anomalies and can afford to throw out a sixth of your data.
I’m not a pro, but you probably want to turn the data into a z-score (this class is ranked 3 standard deviations above the ranking for other self-help classes). If you can’t turn it into a z-score, the data is probably meaningless.
Also, maybe use some other ranking system. I imagine that people have a mindless cached procedure for doing these rankings that you might want to interupt to force acually evaluating it (rank is a random variable with mean = 7 and stddev = 1).
The median is almost always around 7, for almost anything.
An anecdote on a related note...
There was once a long-term online survey about patterns of usage of a particular sort of product (specifics intentionally obscured to protect the guilty). One screen asks something like “Which of these have you used in the past year”, and it shows 4 products of different brands in random order and “None of the above”, and respondents can select multiple brands. Different respondents answer every week, but the results are pretty consistent from one week to the next. Most respondents select one brand.
One week, they took away one of the brands. If it were tracking real usage, you’d expect all of the responses for that brand to have shifted over to “None of the above”. Instead, all of a sudden people had used the other 3 brands about 4⁄3 as often as the previous week. It was exactly the result one would expect if practically everyone were answering randomly. That pattern kept up for a few weeks. Then the question was changed back, and the usage of all 4 brands went back to ‘normal’.
Some of the effect could be accounted for by a substitution principle; instead of asking oneself for each option whether one’s used it in the last year, it’s easier to ask which of them one recalls using most recently (or just which of them seems most salient to memory), check that, and move on. If people do actually switch between products often enough, this would create that dynamic.
I know, I did too, but that is really the sort of calculation that should be done by a large-scale study that documents a control distribution for 0-10 ratings that such ratings can be calibrated against.
treating the indexes as utilities
Please explain.
In my engineering school, we had some project planning classes where we would attempt to calculate what was the best design based on the strength of our preference for performance in a variety of criteria (aesthetics, wieght, strength, cost, etc). Looking back I recognize what we were doing as coming up with a utility function to compute the utilities of the different designs.
Unfortunately, none of us (including the people who had designed the procedure) knew anything about utility functions or decision theory, so they would do things like rank the different criteria, and the strength of each design in each criteria, and then use those directly as utility wieghts and partial utilities.
(so for example strength might be most important (10), then cost (9) then wieght (8) and so on. and then maybe design A would be best (10) in wieght, worst (1) in strength, etc)
I didn’t know any decision theory or anything, but I have a strong sense for noticing errors in mathematical models, and this thing set off alarm bells like crazy. We should have been giving a lot of thought to calibration of our wieghts and utilities to make sure arbitraryness of rankings can’t sneak through and change the answer, but no one gave a shit. I raised a fuss and tried to rederive the whole thing from first principles. I don’t think I got anything, tho, it was only one assignment so I might have given up because of low value (it’s a hard problem). Don’t remember.
Moral:
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
This is why you never use explicit probabilities that aren’t either empirically determined or calculated theoretically.
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
Despite anti-arbitrariness intuitions, there is empirical evidence that this is wrong.
Proper linear models are those in which predictor variables are given weights in such a way that the resulting linear composite optimally predicts some criterion of interest; examples of proper linear models are standard regression analysis, discriminant function analysis, and ridge regression analysis. Research summarized in Paul Meehl’s book on clinical versus statistical prediction—and a plethora of research stimulated in part by that book—all indicates that when a numerical criterion variable (e.g., graduate grade point average) is to be predicted from numerical predictor variables, proper linear models outperform clinical intuition. Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge’s predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weighting is quite robust for making such predictions. The article discusses, in some detail, the application of unit weights to decide what bullet the Denver Police Department should use. Finally, the article considers commonly raised technical, psychological, and ethical resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances.
(this is about something somewhat less arbitrary than using ranks as scores, but it seems like evidence in favor of that approach as well)
Dawes is not a reliable researcher; I have very little confidence in his studies. Check it.
(ETA: I also have other reasons to mistrust Dawes, but shouldn’t go into those here. In general you just shouldn’t trust heuristics and biases results any more than you should trust parapsychology results. (Actually, parapsychology results tend to be significantly better supported.) Almost all psychology is diseased science; the hypotheses are often interesting, the statistical evidence given for them is often anti-informative.)
Multicriteria objective functions are really hard to get right. Weighting features from 10 to 1 is actually a decent first approach- it should separate good solutions from bad solutions- but if you’re down to narrow differences of the weighted objective function, it’s typically time to hand off to a human decision-maker, or spend a lot of time considering tradeoffs to elicit the weights. (Thankfully, a first pass should show you what features you need to value carefully and which features you can ignore.)
If you have relatively few choices and properties are correlated (as of course they are), I’m not sure how much it matters. I did a simulation of this for embryo selection with n=10, and partially randomized the utility weights made little difference.
“pseudo-quantity” is a term I just made up for things that look like quantities (they may even have units), but are fake in some way. Unlike real quantities, for which correct math is always valid, you cannot use math on pseudo-quantities without calibration (which is not always possible).
Example: uncalibrated probability ratings (I’m 95% sure) are not probabilities, and you cannot use them in probability calculations, even though they seem to be numbers with the right units. You can turn them into real probabilities by doing calibration. (assuming they correllate well enough)
So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?
More or less. Other ranking systems could be calibrated to get actual utility coeficients, but rank indexes loose information and cannot even be calibrated.
Probabilities can be empirically wrong, sure, but I find it weird to say that they’re “not probabilities” until they’re calibrated. If you imagine 20 scenarios in this class, and your brain says “I expect to be wrong in one of those”, that just is a probability straight up.
(This may come down to frequency vs belief interpretations of probability, but I think saying that beliefs aren’t probabilistic at all needs defending separately.)
So the pseudo-quantities in your example are strength ratings on a 1-10 scale?
I actually think that’s acceptable, assuming the ratings on the scale are equally spaced, and the weights correspond to the spacing. For instance, space strengths out from 1 to 10 evenly, space weights out from 1 to 10 evenly (where 10 is the best, i.e., lightest), where each interval corresponds to roughly the same level of improvement in the prototype. Then assign weights to go along with how important an improvement is along one axis compared to the other. For instance, if improving strength one point on the scale is twice as valuable as improving weight, we can give strength a weight of 2, and computations like:
Option A, strength 3, weight 6, total score 2(3) + 6 = 12
Still have one degree of freedom. What if you ranked from 10-20? or −5 to 5? As a limiting case consider rankings 100-110: the thing with the highest preference (strength) would totally swamp the calculation, becoming the only concern.
Once you have scale and offset correctly calibrated, you still need to worry about nonlinearity. In this case (using rank indexes), the problem is even worse. Like I said, rank indexes lose information. What if they are all the same wieght but one is drastically lighter? Consider that the rankings are identical no matter how much difference there is. That’s not right. Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.
This is not as hard as FAI, but it’s harder than pulling random numbers out of your butt, multiplying them, and calling it a decision procedure.
I agree that ranking the weights from 1 to N is idiotic because it doesn’t respect the relative importance of each characteristic. However, changing the ratings from 101-110 for every scale will just add a constant to each option’s value:
Option A, strength 103, mass 106, total score 2(103) + 106 = 312
Option B, strength 105, mass 103, total score 2(105) + 103 = 313
(I changed ‘weight to ‘mass’ to avoid confusion with the other meaning of ‘weight’)
Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.
I assume you mean using values for the weights that correspond to importance, which isn’t necessarily 1-10. For instance, if strength is 100 times more important than mass, we’d need to have weights of 100 and 1.
You’re right that this assumes that the final quality is a linear function of the component attributes: we could have a situation where strength becomes less important when mass passes a certain threshold, for instance. But using a linear approximation is often a good first step at the very least.
Option A, strength 103, mass 106, total score 2(103) + 106 = 312
Option B, strength 105, mass 103, total score 2(105) + 103 = 313
Oops, I might have to look at that more closely. I think you are right. The shared offset cancels out.
I assume you mean using values for the weights that correspond to importance, which isn’t necessarily 1-10. For instance, if strength is 100 times more important than mass, we’d need to have weights of 100 and 1.
Using 100 and 1 for something that is 100 times more important is correct (assuming you are able to estimate the weights (100x is awful suspicious)). Idiot procedures were using rank indicies, not real-valued weights.
But using a linear approximation is often a good first step at the very least.
agree. Linearlity is a valid assumption
The error is using uncalibrated rating from 0-10, or worse, rank indicies. Linear valued rating from 0-10 has the potential to carry the information properly, but that does not mean people can produce calibrated estimates there.
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
This is why you never use explicit probabilities that aren’t either empirically determined or calculated theoretically.
This is a very good general point, one that I natively seem to grasp, but even so I’d appreciate it if you wrote a top-level post about it.
Applied. Looks good. Might decide it’s not worth it, but you make a good case.
One thing. 0 to 10 ratings are utterly useless. The median is almost always around 7, for almost anything. Please give us calibrated statistics, not subjective pseudo-quantities where most of the contribution is from noise and offset.
Reminds me of business planning types ranking alternatives 1..n and then treating the indexes as utilities. ick. TYPE ERROR.
We’ve actually noticed in our weekly sessions that our nice official-looking yes-we’re-gathering-data rate-from-1-to-5 feedback forms don’t seem to correlate with how much people seem to visibly enjoy the session—mostly the ratings seem pretty constant. (We’re still collecting useful data off the verbal comments.) If anyone knows a standard fix for this then PLEASE LET US KNOW.
I’d suggest measuring the Net Promoter Score (NPS) (link). It’s used in business as a better measure of customer satisfaction than more traditional measures. See here for evidence, sorry for the not-free link.
“On a scale of 0-10, how likely would you be to recommend the minicamp to a friend or colleague?”
“What is the most important reason for your recommendation?
To interpret, split the responses into 3 groups:
9-10: Promoter—people who will be active advocates.
7-8: Passive—people who are generally positive, but aren’t going to do anything about it.
0-6: Detractor—people who are lukewarm (which will turn others off) or will actively advocate against you
NPS = [% who are Promoters] - [% who are Detractors]. Good vs. bad NPS varies by context, but +20-30% is generally very good. The followup question is a good way to identify key strengths and high priority areas to improve.
NPS is a really valuable concept. Means and medians are pretty worthless compared to identifying the percentage in each class, and it’s sobering to realize that a 6 is a detractor score.
(Personal anecdote: I went to a movie theater, watched a movie, and near the end, during an intense confrontation between the hero and villain, the film broke. I was patient, but when they sent me an email later asking me the NPS question, I gave it a 6. I mean, it wasn’t that bad. Then two free movie tickets came in the mail, with a plea to try them out again.
I hadn’t realized it, but I had already put that theater in my “never go again” file, since why give them another chance? I then read The Ultimate Question for unrelated reasons, and had that experience in my mind the whole time.)
Good anecdote. It made me realize that I had just 20 minutes ago made a damning non-recommendation to a friend based off of a single bad experience after a handful of good ones.
Here is the evidence paper.
Right, I’d forgotten about that. I concur that it is used, and I work in market research sort of.
Another thing you could do is measure in a more granular way—ask for NPS about particular sessions. You could do this after each session or at the end of each day. This would help you narrow down what sessions are and are not working, and why.
You do have to be careful not to overburden people by asking them for too much detailed feedback too frequently, otherwise they’ll get survey fatigue and the quality of responses will markedly decline. Hence, I would resist the temptation to ask more than 1-2 questions about any particular session. If there are any that are markedly well/poorly received, you can follow up on those later.
One idea (which you might be doing already) is making the people collecting the data DIFFERENT from the people organizing/running the sessions.
For example, if Bob organizes and runs a session, and everyone likes Bob, but thinks that the session was so-so, they may be less willing to write negative things down if they know Bob is the one collecting and analyzing data.
If Bob runs the sessions, then SALLY should come in at the end and say something like “Well we want to make these better, so I’M gathering information of ways to improve, etc”
Even if Bob eventually gets the negative information, I think people might be more likely to provide it to Sally (one step removed) than to Bob directly.
(Even better: Nameless Guy organizes a session. Bob teaches session (making sure everyone knows this is NAMELESS’ session, and Bob is just the mouthpiece.)
Also, I would say that verbal comments are generally MUCH more useful than Likert scale information anyways. It’s better to be getting good comments, and bad Likert scores than vice versa.
Back when I did training for a living, my experience was that those forms were primarily useful for keeping my boss happy. The one question that was sometimes useful was asking people what they enjoyed most and least about the class, and what they would change about it. Even more useful was asking that question of people to their faces. Most useful was testing to determine what they had actually learned, if anything.
I’ve seen “rate from 1 to 5, with 3 excluded”, which should be equivalent to “rate from 1 to 4″ but feels substantially different. But there are probably better ones.
In this category of tricks, somebody (I forget who) used a rating scale where you assigned a score of 1, 3, or 9. Which should be equivalent to “rate from 1 to 3”, but...
We weren’t getting a lot of threes, but maybe that works anyway.
Then maybe “1 to 4, excluding 3” or “1 to 5, excluding 4″, to rule out the lazy answer “everything’s basically fine”. That might force people to find an explanation whenever they feel the thing is good but not perfect.
If you start getting 5s too frequently, then it’s probably not a good trick.
Why not go all the way and just use a plus-minus-zero system like LW ratings (and much of the rest of the internet)? Youtube had an interesting chart before they switched from 5 star rating systems to the like-dislike system showing how useless the star ratings were. But that’s non-mandatory so its very different.
Another thing you could do is measure in a more granular way—ask for NPS about particular sessions. You could do this after each session or at the end of each day. This would help you narrow down what sessions are and are not working, and why.
You do have to be careful not to overburden people by asking them for too much detailed feedback too frequently, otherwise they’ll get survey fatigue and the quality of responses will markedly decline. Hence, I would resist the temptation to ask more than 1-2 questions about any particular session. If there are any that are markedly well/poorly received, you can follow up on those later.
You could have a rubric without any numbers, just 10 sentences or so where participants could circle those that apply. E.g. “I learned techniques in this session that I will apply at least once a week in my everyday life”, “Some aspects of this session were kind of boring”, “This session was better presented than a typical college lecture”, etc.
You could try a variant of this (give someone a d10 and a d6, hide roll from surveyor, if the d6 comes up 1 they give you a 1-10 rating based on the d10 and are otherwise honest) but this may not be useful in cases where people aren’t deliberately lying to you, and is probably only worth it if you have enough sample size to wipe out random anomalies and can afford to throw out a sixth of your data.
Or weight the die.
I’m not a pro, but you probably want to turn the data into a z-score (this class is ranked 3 standard deviations above the ranking for other self-help classes). If you can’t turn it into a z-score, the data is probably meaningless.
Also, maybe use some other ranking system. I imagine that people have a mindless cached procedure for doing these rankings that you might want to interupt to force acually evaluating it (rank is a random variable with mean = 7 and stddev = 1).
An anecdote on a related note...
There was once a long-term online survey about patterns of usage of a particular sort of product (specifics intentionally obscured to protect the guilty). One screen asks something like “Which of these have you used in the past year”, and it shows 4 products of different brands in random order and “None of the above”, and respondents can select multiple brands. Different respondents answer every week, but the results are pretty consistent from one week to the next. Most respondents select one brand.
One week, they took away one of the brands. If it were tracking real usage, you’d expect all of the responses for that brand to have shifted over to “None of the above”. Instead, all of a sudden people had used the other 3 brands about 4⁄3 as often as the previous week. It was exactly the result one would expect if practically everyone were answering randomly. That pattern kept up for a few weeks. Then the question was changed back, and the usage of all 4 brands went back to ‘normal’.
Some of the effect could be accounted for by a substitution principle; instead of asking oneself for each option whether one’s used it in the last year, it’s easier to ask which of them one recalls using most recently (or just which of them seems most salient to memory), check that, and move on. If people do actually switch between products often enough, this would create that dynamic.
I tried to take that into account when reading.
Please explain.
I know, I did too, but that is really the sort of calculation that should be done by a large-scale study that documents a control distribution for 0-10 ratings that such ratings can be calibrated against.
In my engineering school, we had some project planning classes where we would attempt to calculate what was the best design based on the strength of our preference for performance in a variety of criteria (aesthetics, wieght, strength, cost, etc). Looking back I recognize what we were doing as coming up with a utility function to compute the utilities of the different designs.
Unfortunately, none of us (including the people who had designed the procedure) knew anything about utility functions or decision theory, so they would do things like rank the different criteria, and the strength of each design in each criteria, and then use those directly as utility wieghts and partial utilities.
(so for example strength might be most important (10), then cost (9) then wieght (8) and so on. and then maybe design A would be best (10) in wieght, worst (1) in strength, etc)
I didn’t know any decision theory or anything, but I have a strong sense for noticing errors in mathematical models, and this thing set off alarm bells like crazy. We should have been giving a lot of thought to calibration of our wieghts and utilities to make sure arbitraryness of rankings can’t sneak through and change the answer, but no one gave a shit. I raised a fuss and tried to rederive the whole thing from first principles. I don’t think I got anything, tho, it was only one assignment so I might have given up because of low value (it’s a hard problem). Don’t remember.
Moral:
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
This is why you never use explicit probabilities that aren’t either empirically determined or calculated theoretically.
Despite anti-arbitrariness intuitions, there is empirical evidence that this is wrong.
The Robust Beauty of Improper Linear Models
(this is about something somewhat less arbitrary than using ranks as scores, but it seems like evidence in favor of that approach as well)
Dawes is not a reliable researcher; I have very little confidence in his studies. Check it.
(ETA: I also have other reasons to mistrust Dawes, but shouldn’t go into those here. In general you just shouldn’t trust heuristics and biases results any more than you should trust parapsychology results. (Actually, parapsychology results tend to be significantly better supported.) Almost all psychology is diseased science; the hypotheses are often interesting, the statistical evidence given for them is often anti-informative.)
Multicriteria objective functions are really hard to get right. Weighting features from 10 to 1 is actually a decent first approach- it should separate good solutions from bad solutions- but if you’re down to narrow differences of the weighted objective function, it’s typically time to hand off to a human decision-maker, or spend a lot of time considering tradeoffs to elicit the weights. (Thankfully, a first pass should show you what features you need to value carefully and which features you can ignore.)
If you have relatively few choices and properties are correlated (as of course they are), I’m not sure how much it matters. I did a simulation of this for embryo selection with n=10, and partially randomized the utility weights made little difference.
I’m not sure I understand what you mean by pseudo-quantities.
So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?
Right- that can cause this problem. (Not quite the same dynamic, but you get the idea.)
“pseudo-quantity” is a term I just made up for things that look like quantities (they may even have units), but are fake in some way. Unlike real quantities, for which correct math is always valid, you cannot use math on pseudo-quantities without calibration (which is not always possible).
Example: uncalibrated probability ratings (I’m 95% sure) are not probabilities, and you cannot use them in probability calculations, even though they seem to be numbers with the right units. You can turn them into real probabilities by doing calibration. (assuming they correllate well enough)
More or less. Other ranking systems could be calibrated to get actual utility coeficients, but rank indexes loose information and cannot even be calibrated.
Probabilities can be empirically wrong, sure, but I find it weird to say that they’re “not probabilities” until they’re calibrated. If you imagine 20 scenarios in this class, and your brain says “I expect to be wrong in one of those”, that just is a probability straight up.
(This may come down to frequency vs belief interpretations of probability, but I think saying that beliefs aren’t probabilistic at all needs defending separately.)
So the pseudo-quantities in your example are strength ratings on a 1-10 scale?
I actually think that’s acceptable, assuming the ratings on the scale are equally spaced, and the weights correspond to the spacing. For instance, space strengths out from 1 to 10 evenly, space weights out from 1 to 10 evenly (where 10 is the best, i.e., lightest), where each interval corresponds to roughly the same level of improvement in the prototype. Then assign weights to go along with how important an improvement is along one axis compared to the other. For instance, if improving strength one point on the scale is twice as valuable as improving weight, we can give strength a weight of 2, and computations like:
Option A, strength 3, weight 6, total score 2(3) + 6 = 12
Option B, strength 5, weight 3, total score 2(5) + 3 = 13
make sense.
Still have one degree of freedom. What if you ranked from 10-20? or −5 to 5? As a limiting case consider rankings 100-110: the thing with the highest preference (strength) would totally swamp the calculation, becoming the only concern.
Once you have scale and offset correctly calibrated, you still need to worry about nonlinearity. In this case (using rank indexes), the problem is even worse. Like I said, rank indexes lose information. What if they are all the same wieght but one is drastically lighter? Consider that the rankings are identical no matter how much difference there is. That’s not right. Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.
This is not as hard as FAI, but it’s harder than pulling random numbers out of your butt, multiplying them, and calling it a decision procedure.
I agree that ranking the weights from 1 to N is idiotic because it doesn’t respect the relative importance of each characteristic. However, changing the ratings from 101-110 for every scale will just add a constant to each option’s value:
Option A, strength 103, mass 106, total score 2(103) + 106 = 312
Option B, strength 105, mass 103, total score 2(105) + 103 = 313
(I changed ‘weight to ‘mass’ to avoid confusion with the other meaning of ‘weight’)
I assume you mean using values for the weights that correspond to importance, which isn’t necessarily 1-10. For instance, if strength is 100 times more important than mass, we’d need to have weights of 100 and 1.
You’re right that this assumes that the final quality is a linear function of the component attributes: we could have a situation where strength becomes less important when mass passes a certain threshold, for instance. But using a linear approximation is often a good first step at the very least.
Remember that whenever you want a * for multiplying numbers together, you need to write \*.
Oops, I might have to look at that more closely. I think you are right. The shared offset cancels out.
Using 100 and 1 for something that is 100 times more important is correct (assuming you are able to estimate the weights (100x is awful suspicious)). Idiot procedures were using rank indicies, not real-valued weights.
agree. Linearlity is a valid assumption
The error is using uncalibrated rating from 0-10, or worse, rank indicies. Linear valued rating from 0-10 has the potential to carry the information properly, but that does not mean people can produce calibrated estimates there.
This is a very good general point, one that I natively seem to grasp, but even so I’d appreciate it if you wrote a top-level post about it.