I know, I did too, but that is really the sort of calculation that should be done by a large-scale study that documents a control distribution for 0-10 ratings that such ratings can be calibrated against.
treating the indexes as utilities
Please explain.
In my engineering school, we had some project planning classes where we would attempt to calculate what was the best design based on the strength of our preference for performance in a variety of criteria (aesthetics, wieght, strength, cost, etc). Looking back I recognize what we were doing as coming up with a utility function to compute the utilities of the different designs.
Unfortunately, none of us (including the people who had designed the procedure) knew anything about utility functions or decision theory, so they would do things like rank the different criteria, and the strength of each design in each criteria, and then use those directly as utility wieghts and partial utilities.
(so for example strength might be most important (10), then cost (9) then wieght (8) and so on. and then maybe design A would be best (10) in wieght, worst (1) in strength, etc)
I didn’t know any decision theory or anything, but I have a strong sense for noticing errors in mathematical models, and this thing set off alarm bells like crazy. We should have been giving a lot of thought to calibration of our wieghts and utilities to make sure arbitraryness of rankings can’t sneak through and change the answer, but no one gave a shit. I raised a fuss and tried to rederive the whole thing from first principles. I don’t think I got anything, tho, it was only one assignment so I might have given up because of low value (it’s a hard problem). Don’t remember.
Moral:
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
This is why you never use explicit probabilities that aren’t either empirically determined or calculated theoretically.
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
Despite anti-arbitrariness intuitions, there is empirical evidence that this is wrong.
Proper linear models are those in which predictor variables are given weights in such a way that the resulting linear composite optimally predicts some criterion of interest; examples of proper linear models are standard regression analysis, discriminant function analysis, and ridge regression analysis. Research summarized in Paul Meehl’s book on clinical versus statistical prediction—and a plethora of research stimulated in part by that book—all indicates that when a numerical criterion variable (e.g., graduate grade point average) is to be predicted from numerical predictor variables, proper linear models outperform clinical intuition. Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge’s predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weighting is quite robust for making such predictions. The article discusses, in some detail, the application of unit weights to decide what bullet the Denver Police Department should use. Finally, the article considers commonly raised technical, psychological, and ethical resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances.
(this is about something somewhat less arbitrary than using ranks as scores, but it seems like evidence in favor of that approach as well)
Dawes is not a reliable researcher; I have very little confidence in his studies. Check it.
(ETA: I also have other reasons to mistrust Dawes, but shouldn’t go into those here. In general you just shouldn’t trust heuristics and biases results any more than you should trust parapsychology results. (Actually, parapsychology results tend to be significantly better supported.) Almost all psychology is diseased science; the hypotheses are often interesting, the statistical evidence given for them is often anti-informative.)
Multicriteria objective functions are really hard to get right. Weighting features from 10 to 1 is actually a decent first approach- it should separate good solutions from bad solutions- but if you’re down to narrow differences of the weighted objective function, it’s typically time to hand off to a human decision-maker, or spend a lot of time considering tradeoffs to elicit the weights. (Thankfully, a first pass should show you what features you need to value carefully and which features you can ignore.)
If you have relatively few choices and properties are correlated (as of course they are), I’m not sure how much it matters. I did a simulation of this for embryo selection with n=10, and partially randomized the utility weights made little difference.
“pseudo-quantity” is a term I just made up for things that look like quantities (they may even have units), but are fake in some way. Unlike real quantities, for which correct math is always valid, you cannot use math on pseudo-quantities without calibration (which is not always possible).
Example: uncalibrated probability ratings (I’m 95% sure) are not probabilities, and you cannot use them in probability calculations, even though they seem to be numbers with the right units. You can turn them into real probabilities by doing calibration. (assuming they correllate well enough)
So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?
More or less. Other ranking systems could be calibrated to get actual utility coeficients, but rank indexes loose information and cannot even be calibrated.
Probabilities can be empirically wrong, sure, but I find it weird to say that they’re “not probabilities” until they’re calibrated. If you imagine 20 scenarios in this class, and your brain says “I expect to be wrong in one of those”, that just is a probability straight up.
(This may come down to frequency vs belief interpretations of probability, but I think saying that beliefs aren’t probabilistic at all needs defending separately.)
So the pseudo-quantities in your example are strength ratings on a 1-10 scale?
I actually think that’s acceptable, assuming the ratings on the scale are equally spaced, and the weights correspond to the spacing. For instance, space strengths out from 1 to 10 evenly, space weights out from 1 to 10 evenly (where 10 is the best, i.e., lightest), where each interval corresponds to roughly the same level of improvement in the prototype. Then assign weights to go along with how important an improvement is along one axis compared to the other. For instance, if improving strength one point on the scale is twice as valuable as improving weight, we can give strength a weight of 2, and computations like:
Option A, strength 3, weight 6, total score 2(3) + 6 = 12
Still have one degree of freedom. What if you ranked from 10-20? or −5 to 5? As a limiting case consider rankings 100-110: the thing with the highest preference (strength) would totally swamp the calculation, becoming the only concern.
Once you have scale and offset correctly calibrated, you still need to worry about nonlinearity. In this case (using rank indexes), the problem is even worse. Like I said, rank indexes lose information. What if they are all the same wieght but one is drastically lighter? Consider that the rankings are identical no matter how much difference there is. That’s not right. Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.
This is not as hard as FAI, but it’s harder than pulling random numbers out of your butt, multiplying them, and calling it a decision procedure.
I agree that ranking the weights from 1 to N is idiotic because it doesn’t respect the relative importance of each characteristic. However, changing the ratings from 101-110 for every scale will just add a constant to each option’s value:
Option A, strength 103, mass 106, total score 2(103) + 106 = 312
Option B, strength 105, mass 103, total score 2(105) + 103 = 313
(I changed ‘weight to ‘mass’ to avoid confusion with the other meaning of ‘weight’)
Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.
I assume you mean using values for the weights that correspond to importance, which isn’t necessarily 1-10. For instance, if strength is 100 times more important than mass, we’d need to have weights of 100 and 1.
You’re right that this assumes that the final quality is a linear function of the component attributes: we could have a situation where strength becomes less important when mass passes a certain threshold, for instance. But using a linear approximation is often a good first step at the very least.
Option A, strength 103, mass 106, total score 2(103) + 106 = 312
Option B, strength 105, mass 103, total score 2(105) + 103 = 313
Oops, I might have to look at that more closely. I think you are right. The shared offset cancels out.
I assume you mean using values for the weights that correspond to importance, which isn’t necessarily 1-10. For instance, if strength is 100 times more important than mass, we’d need to have weights of 100 and 1.
Using 100 and 1 for something that is 100 times more important is correct (assuming you are able to estimate the weights (100x is awful suspicious)). Idiot procedures were using rank indicies, not real-valued weights.
But using a linear approximation is often a good first step at the very least.
agree. Linearlity is a valid assumption
The error is using uncalibrated rating from 0-10, or worse, rank indicies. Linear valued rating from 0-10 has the potential to carry the information properly, but that does not mean people can produce calibrated estimates there.
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
This is why you never use explicit probabilities that aren’t either empirically determined or calculated theoretically.
This is a very good general point, one that I natively seem to grasp, but even so I’d appreciate it if you wrote a top-level post about it.
I tried to take that into account when reading.
Please explain.
I know, I did too, but that is really the sort of calculation that should be done by a large-scale study that documents a control distribution for 0-10 ratings that such ratings can be calibrated against.
In my engineering school, we had some project planning classes where we would attempt to calculate what was the best design based on the strength of our preference for performance in a variety of criteria (aesthetics, wieght, strength, cost, etc). Looking back I recognize what we were doing as coming up with a utility function to compute the utilities of the different designs.
Unfortunately, none of us (including the people who had designed the procedure) knew anything about utility functions or decision theory, so they would do things like rank the different criteria, and the strength of each design in each criteria, and then use those directly as utility wieghts and partial utilities.
(so for example strength might be most important (10), then cost (9) then wieght (8) and so on. and then maybe design A would be best (10) in wieght, worst (1) in strength, etc)
I didn’t know any decision theory or anything, but I have a strong sense for noticing errors in mathematical models, and this thing set off alarm bells like crazy. We should have been giving a lot of thought to calibration of our wieghts and utilities to make sure arbitraryness of rankings can’t sneak through and change the answer, but no one gave a shit. I raised a fuss and tried to rederive the whole thing from first principles. I don’t think I got anything, tho, it was only one assignment so I might have given up because of low value (it’s a hard problem). Don’t remember.
Moral:
With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use “math” on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.
This is why you never use explicit probabilities that aren’t either empirically determined or calculated theoretically.
Despite anti-arbitrariness intuitions, there is empirical evidence that this is wrong.
The Robust Beauty of Improper Linear Models
(this is about something somewhat less arbitrary than using ranks as scores, but it seems like evidence in favor of that approach as well)
Dawes is not a reliable researcher; I have very little confidence in his studies. Check it.
(ETA: I also have other reasons to mistrust Dawes, but shouldn’t go into those here. In general you just shouldn’t trust heuristics and biases results any more than you should trust parapsychology results. (Actually, parapsychology results tend to be significantly better supported.) Almost all psychology is diseased science; the hypotheses are often interesting, the statistical evidence given for them is often anti-informative.)
Multicriteria objective functions are really hard to get right. Weighting features from 10 to 1 is actually a decent first approach- it should separate good solutions from bad solutions- but if you’re down to narrow differences of the weighted objective function, it’s typically time to hand off to a human decision-maker, or spend a lot of time considering tradeoffs to elicit the weights. (Thankfully, a first pass should show you what features you need to value carefully and which features you can ignore.)
If you have relatively few choices and properties are correlated (as of course they are), I’m not sure how much it matters. I did a simulation of this for embryo selection with n=10, and partially randomized the utility weights made little difference.
I’m not sure I understand what you mean by pseudo-quantities.
So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?
Right- that can cause this problem. (Not quite the same dynamic, but you get the idea.)
“pseudo-quantity” is a term I just made up for things that look like quantities (they may even have units), but are fake in some way. Unlike real quantities, for which correct math is always valid, you cannot use math on pseudo-quantities without calibration (which is not always possible).
Example: uncalibrated probability ratings (I’m 95% sure) are not probabilities, and you cannot use them in probability calculations, even though they seem to be numbers with the right units. You can turn them into real probabilities by doing calibration. (assuming they correllate well enough)
More or less. Other ranking systems could be calibrated to get actual utility coeficients, but rank indexes loose information and cannot even be calibrated.
Probabilities can be empirically wrong, sure, but I find it weird to say that they’re “not probabilities” until they’re calibrated. If you imagine 20 scenarios in this class, and your brain says “I expect to be wrong in one of those”, that just is a probability straight up.
(This may come down to frequency vs belief interpretations of probability, but I think saying that beliefs aren’t probabilistic at all needs defending separately.)
So the pseudo-quantities in your example are strength ratings on a 1-10 scale?
I actually think that’s acceptable, assuming the ratings on the scale are equally spaced, and the weights correspond to the spacing. For instance, space strengths out from 1 to 10 evenly, space weights out from 1 to 10 evenly (where 10 is the best, i.e., lightest), where each interval corresponds to roughly the same level of improvement in the prototype. Then assign weights to go along with how important an improvement is along one axis compared to the other. For instance, if improving strength one point on the scale is twice as valuable as improving weight, we can give strength a weight of 2, and computations like:
Option A, strength 3, weight 6, total score 2(3) + 6 = 12
Option B, strength 5, weight 3, total score 2(5) + 3 = 13
make sense.
Still have one degree of freedom. What if you ranked from 10-20? or −5 to 5? As a limiting case consider rankings 100-110: the thing with the highest preference (strength) would totally swamp the calculation, becoming the only concern.
Once you have scale and offset correctly calibrated, you still need to worry about nonlinearity. In this case (using rank indexes), the problem is even worse. Like I said, rank indexes lose information. What if they are all the same wieght but one is drastically lighter? Consider that the rankings are identical no matter how much difference there is. That’s not right. Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.
This is not as hard as FAI, but it’s harder than pulling random numbers out of your butt, multiplying them, and calling it a decision procedure.
I agree that ranking the weights from 1 to N is idiotic because it doesn’t respect the relative importance of each characteristic. However, changing the ratings from 101-110 for every scale will just add a constant to each option’s value:
Option A, strength 103, mass 106, total score 2(103) + 106 = 312
Option B, strength 105, mass 103, total score 2(105) + 103 = 313
(I changed ‘weight to ‘mass’ to avoid confusion with the other meaning of ‘weight’)
I assume you mean using values for the weights that correspond to importance, which isn’t necessarily 1-10. For instance, if strength is 100 times more important than mass, we’d need to have weights of 100 and 1.
You’re right that this assumes that the final quality is a linear function of the component attributes: we could have a situation where strength becomes less important when mass passes a certain threshold, for instance. But using a linear approximation is often a good first step at the very least.
Remember that whenever you want a * for multiplying numbers together, you need to write \*.
Oops, I might have to look at that more closely. I think you are right. The shared offset cancels out.
Using 100 and 1 for something that is 100 times more important is correct (assuming you are able to estimate the weights (100x is awful suspicious)). Idiot procedures were using rank indicies, not real-valued weights.
agree. Linearlity is a valid assumption
The error is using uncalibrated rating from 0-10, or worse, rank indicies. Linear valued rating from 0-10 has the potential to carry the information properly, but that does not mean people can produce calibrated estimates there.
This is a very good general point, one that I natively seem to grasp, but even so I’d appreciate it if you wrote a top-level post about it.