Right. I think I agree with everything you wrote here, but here it is again in my own words:
In communicating with people, the goal isn’t to ask a hypothetically “best” question and wonder why people don’t understand or don’t respond in the “correct” way. The goal is to be understood and to share information and acquire consensus or agree on some negotiation or otherwise accomplish some task.
This means that in real communication with real people, you often need to ask different questions to different people to arrive at the same information, or phrase some statement differently for it to be understood. There shouldn’t be any surprise or paradox here. When I am discussing an engineering problem with engineers, I phrase it in the terminology that engineers will understand. When I need to communicate that same problem to upper management, I do not use the same terminology that I use with my engineers.
Likewise, there’s a difference when I’m communicating with some engineering intern or new grad right out of college, vs a senior engineer with a decade of experience. I tailor my speech for my audience.
In particular, if I asked this question to Kenoubi (“what’s the worst case for how long this thesis could take you?”), and Kenoubi replied “It never finishes”, then I would immediately follow up with the question, “Ok, considering cases when it does finish, what’s the worst-case look like?” And if that got the reply “the day before it is required to be due”, I would then start poking at “What would would cause that to occur?”.
The reason why I start with the first question is because it works for, I don’t know, 95% of people I’ve ever interacted with in my life? In my mind, it’s rational to start with a question that almost always elicits the information I care about, even if there’s some small subset of the population that will force me to choose my words as if they’re being interpreted by a Monkey’s paw.
First, consider the question of, “are these predictions totally useless?” This is an important question because I stand by my claim that the answer of “never” is actually totally useless due to how trivial it is.
Despite the optimistic bias, respondents’ best estimates were by no means devoid of information: The predicted completion times were highly correlated with actual completion times (r = .77, p < .001). Compared with others in the sample, respondents who predicted that they would take more time to finish actually did take more time. Predictions can be informative even in the presence of a marked prediction bias.
...
Respondents’ optimistic and pessimistic predictions were both strongly correlated with their actual completion times (rs = .73 and .72, respectively; ps < .01).
Yep. Matches my experience.
We know that only 11% of students met their optimistic targets, and only 30% of students met their “best guess” targets. What about the pessimistic target? It turns out, 50% of the students did finish by that target. That’s not just a quirk, because it’s actually related to the distribution itself.
However, the distribution of difference scores from the best-guess predictions were markedly skewed, with a long tail on the optimistic side of zero, a cluster of scores within 5 or 10 days of zero, and virtually no scores on the pessimistic side of zero. In contrast, the differences from the worst-case predictions were noticeably more symmetric around zero, with the number of markedly pessimistic predictions balancing the number of extremely optimistic predictions.
In other words, asking people for a best guess or an optimistic prediction results in a biased prediction that is almost always earlier than a real delivery date. On the other hand, while the pessimistic question is not more accurate (it has the same absolute error margins), it isunbiased. The reality is that the study says that people asked for a pessimistic question were equally likely to over-estimate their deadline as they were to under-estimate it. If you don’t think a question that gives you a distribution centered on the right answer is useful, I’m not sure what to tell you.
The paper actually did a number of experiments. That was just the first.
In the third experiment, the study tried to understand what people are thinking about when estimating.
Proportionally more responses concerned future scenarios (M = .74) than relevant past experiences (M =.07), r(66) = 13.80, p < .001. Furthermore, a much higher proportion of subjects’ thoughts involved planning for a project and imagining its likely progress (M =.71) rather than considering potential impediments (M = .03), r(66) = 18.03, p < .001.
This seems relevant considering that the idea of premortems or “worst case” questioning is to elicit impediments, and the project managers / engineering leads doing that questioning are intending to hear about impediments and will continue their questioning until they’ve been satisfied that the group is actually discussing that.
In the fourth experiment, the study tries to understand why it is that people don’t think about their past experiences. They discovered that just prompting people to consider past experiences was insufficient, they actually needed additional prompting to make their past experience “relevant” to their current task.
Subsequent comparisons revealed that subjects in the recall-relevant condition predicted they would finish the assignment later than subjects in either the recall condition, t(79) = 1.99, p < .05, or the control condition, f(80) = 2.14, p < .04, which did not differ significantly from each other, t(& 1) < 1
...
Further analyses were performed on the difference between subjects’ predicted and actual completion times. Subjects underestimated their completion times significantly in the control (M = −1.3 days), r(40) = 3.03, p < .01, and recall conditions (M = −1.0 day), t(41) = 2.10, p < .05, but not in the recall-relevant condition (M = −0.1 days), ((39) < i. Moreover, a higher percentage of subjects finished the assignments in the predicted time in the recall-relevant condition (60.0%) than in the recall and control conditions (38.1% and 29.3%, respectively), x2G, N = 123) = 7.63, p < .01. The latter two conditions did not differ significantly from each other.
...
The absence of an effect in the recall condition is rather remarkable. In this condition, subjects first described their past performance with projects similar to the computer assignment and acknowledged that they typically finish only 1 day before deadlines. Following a suggestion to “keep in mind previous experiences with assignments,” they then predicted when they would finish the computer assignment. Despite this seemingly powerful manipulation, subjects continued to make overly optimistic forecasts. Apparently, subjects were able to acknowledge their past experiences but disassociate those episodes from their present predictions. In contrast, the impact of the recall-relevant procedure was sufficiently robust to eliminate the optimistic bias in both deadline conditions
How does this compare to the first experiment?
Interestingly, although the completion estimates were less biased in the recall-relevant condition than in the other conditions, they were not more strongly correlated with actual completion times, nor was the absolute prediction error any smaller. The optimistic bias was eliminated in the recall-relevant condition because subjects’ predictions were as likely to be too long as they were to be too short. The effects of this manipulation mirror those obtained with the instruction to provide pessimistic predictions in the first study: When students predicted the completion date for their honor’s thesis on the assumption that “everything went as poorly as it possibly could” they produced unbiased but no more accurate predictions than when they made their “best guesses.”
It’s common in engineering to perform group estimates. Does the study look at that? Yep, the fifth and last experiment asks individuals to estimate the performance of others.
As hypothesized, observers seemed more attuned to the actors’ base rates than did the actors themselves. Observers spontaneously used the past as a basis for predicting actors’ task completion times and produced estimates that were later than both the actors’ estimates and their completion times.
So observers are more pessimistic. Actually, observers are so pessimistic that you have to average it with the optimistic estimates to get an unbiased estimate.
One of the most consistent findings throughout our investigation was that manipulations that reduced the directional (optimistic) bias in completion estimates were ineffective in in- creasing absolute accuracy. This implies that our manipulations did not give subjects any greater insight into the particular predictions they were making, nor did they cause all subjects to become more pessimistic (see Footnote 2), but instead caused enough subjects to become overly pessimistic to counterbalance the subjects who remained overly optimistic. It remains for future research to identify those factors that lead people to make more accurate, as well as unbiased, predictions. In the real world, absolute accuracy is sometimes not as important as (a) the proportion of times that the task is completed by the “best-guess” date and (b) the proportion of dramatically optimistic, and therefore memorable, prediction failures. By both of these criteria, factors that decrease the optimistic bias “improve” the quality of intuitive prediction.
At the end of the day, there are certain things that are known about scheduling / prediction.
In general, individuals are as wrong as they are right for any given estimate.
In general, people are overly optimistic.
But, estimates generally correlate well with actual duration—if an individual thinks something is longer in estimate than another task, it most likely is! This is why in SW sometimes estimation is not in units of time at all, but in a concept called “points”.
The larger and more nebulously scoped the task, the worse any estimates will be in absolute error.
The length of a time a task can take follows a distribution with a very long right tail—a task that takes way longer than expected can take an arbitrary amount of time, but the fastest time to complete a task is limited.
The best way to actually schedule or predict a project is to break it down into as many small component tasks as possible, identify dependencies between those tasks, and produce most likely, optimistic, and pessimistic estimates for each task, and then run a simulation for chain of dependencies to see what the expected project completion looks like. Use a Gantt chart. This is a boring answer because it’s the “learn project management” answer, and people will hate on it because gesture vaguely to all of the projects that overrun their schedule. There are many interesting reasons for why that happens and why I don’t think it’s a massive failure of rationality, but I’m not sure this comment is a good place to go into detail on that. The quick answer is that comical overrun of a schedule has less to do with an inability to create correct schedules from an engineering / evidence-based perspective, and much more to do with a bureaucratic or organizational refusal to accept an evidence-based schedule when a totally false but politically palatable “optimistic” schedule is preferred.
The best way to actually schedule or predict a project is to break it down into as many small component tasks as possible, identify dependencies between those tasks, and produce most likely, optimistic, and pessimistic estimates for each task, and then run a simulation for chain of dependencies to see what the expected project completion looks like. Use a Gantt chart. This is a boring answer because it’s the “learn project management” answer, and people will hate on it because gesture vaguely to all of the projects that overrun their schedule. There are many interesting reasons for why that happens and why I don’t think it’s a massive failure of rationality, but I’m not sure this comment is a good place to go into detail on that. The quick answer is that comical overrun of a schedule has less to do with an inability to create correct schedules from an engineering / evidence-based perspective, and much more to do with a bureaucratic or organizational refusal to accept an evidence-based schedule when a totally false but politically palatable “optimistic” schedule is preferred.
I definitely agree that this is the way to get the most accurate prediction practically possible, and that organizational dysfunction often means this isn’t used, even when the organization would be better able to achieve its goals with an accurate prediction. But I also think that depending on the type of project, producing an accurate Gantt chart may take a substantial fraction of the effort (or even a substantial fraction of the wall-clock time) of finishing the entire project, or may not even be possible without already having some of the outputs of the processes earlier in the chart. These aren’t necessarily possible to eradicate, so the take-away, I think, is not to be overly optimistic about the possibility of getting accurate schedules, even when there are no ill intentions and all known techniques to make more accurate schedules are used.
In other words, asking people for a best guess or an optimistic prediction results in a biased prediction that is almost always earlier than a real delivery date. On the other hand, while the pessimistic question is not more accurate (it has the same absolute error margins), it is unbiased. The reality is that the study says that people asked for a pessimistic question were equally likely to over-estimate their deadline as they were to under-estimate it. If you don’t think a question that gives you a distribution centered on the right answer is useful, I’m not sure what to tell you.
It’s interesting that the median of the pessimistic expectations is about equal to the median of the actual results. The mean clearly wasn’t, as that discrepancy was literally the point of citing this statistic in the OP:
in a classic experiment, 37 psychology students were asked to estimate how long it would take them to finish their senior theses “if everything went as poorly as it possibly could,” and they still underestimated the time it would take, as a group (the average prediction was 48.6 days, and the average actual completion time was 55.5 days).
So the estimates were biased, but not median-biased (at least that’s what Wikipedia appears to say the terminology is). Less biased than other estimates, though. Of course this assumes we’re taking the answer to “how long would it take if everything went as poorly as it possibly could” and interpreting it as the answer to “how long will it actually take”, and if students were actually asked after the fact if everything went as poorly as it possibly could, I predict they would mostly say no. And treating the text “if everything went as poorly as it possibly could” as if it wasn’t even there is clearly wrong too, because they gave a different (more biased towards optimism) answer if it was omitted.
This specific question seems kind of hard to make use of from a first-person perspective. But I guess maybe as a third party one could ask for worst-possible estimates and then treat them as median-unbiased estimators of what will actually happen? Though I also don’t know if the median-unbiasedness is a happy accident. (It’s not just a happy accident, there’s something there, but I don’t know whether it would generalize to non-academic projects, projects executed by 3rd parties rather than oneself, money rather than time estimates, etc.)
I do still also think there’s a question of how motivated the students were to give accurate answers, although I’m not claiming that if properly motivated they would re-invent Murphyjitsu / the pre-mortem / etc. from whole cloth; they’d probably still need to already know about some technique like that and believe it could help get more accurate answers. But even if a technique like that is an available action, it sounds like a lot of work, only worth doing if the output has a lot of value (e.g. if one suspects a substantial chance of not finishing the thesis before it’s due, one might wish to figure out why so one could actively address some of the reasons).
Right. I think I agree with everything you wrote here, but here it is again in my own words:
In communicating with people, the goal isn’t to ask a hypothetically “best” question and wonder why people don’t understand or don’t respond in the “correct” way. The goal is to be understood and to share information and acquire consensus or agree on some negotiation or otherwise accomplish some task.
This means that in real communication with real people, you often need to ask different questions to different people to arrive at the same information, or phrase some statement differently for it to be understood. There shouldn’t be any surprise or paradox here. When I am discussing an engineering problem with engineers, I phrase it in the terminology that engineers will understand. When I need to communicate that same problem to upper management, I do not use the same terminology that I use with my engineers.
Likewise, there’s a difference when I’m communicating with some engineering intern or new grad right out of college, vs a senior engineer with a decade of experience. I tailor my speech for my audience.
In particular, if I asked this question to Kenoubi (“what’s the worst case for how long this thesis could take you?”), and Kenoubi replied “It never finishes”, then I would immediately follow up with the question, “Ok, considering cases when it does finish, what’s the worst-case look like?” And if that got the reply “the day before it is required to be due”, I would then start poking at “What would would cause that to occur?”.
The reason why I start with the first question is because it works for, I don’t know, 95% of people I’ve ever interacted with in my life? In my mind, it’s rational to start with a question that almost always elicits the information I care about, even if there’s some small subset of the population that will force me to choose my words as if they’re being interpreted by a Monkey’s paw.
It didn’t work for the students in the study in the OP. That’s literally why the OP mentioned it!
It depends on what you mean by “didn’t work”. The study described is published in a paper only 16 pages long. We can just read it: http://web.mit.edu/curhan/www/docs/Articles/biases/67_J_Personality_and_Social_Psychology_366,_1994.pdf
First, consider the question of, “are these predictions totally useless?” This is an important question because I stand by my claim that the answer of “never” is actually totally useless due to how trivial it is.
Yep. Matches my experience.
We know that only 11% of students met their optimistic targets, and only 30% of students met their “best guess” targets. What about the pessimistic target? It turns out, 50% of the students did finish by that target. That’s not just a quirk, because it’s actually related to the distribution itself.
In other words, asking people for a best guess or an optimistic prediction results in a biased prediction that is almost always earlier than a real delivery date. On the other hand, while the pessimistic question is not more accurate (it has the same absolute error margins), it is unbiased. The reality is that the study says that people asked for a pessimistic question were equally likely to over-estimate their deadline as they were to under-estimate it. If you don’t think a question that gives you a distribution centered on the right answer is useful, I’m not sure what to tell you.
The paper actually did a number of experiments. That was just the first.
In the third experiment, the study tried to understand what people are thinking about when estimating.
This seems relevant considering that the idea of premortems or “worst case” questioning is to elicit impediments, and the project managers / engineering leads doing that questioning are intending to hear about impediments and will continue their questioning until they’ve been satisfied that the group is actually discussing that.
In the fourth experiment, the study tries to understand why it is that people don’t think about their past experiences. They discovered that just prompting people to consider past experiences was insufficient, they actually needed additional prompting to make their past experience “relevant” to their current task.
How does this compare to the first experiment?
It’s common in engineering to perform group estimates. Does the study look at that? Yep, the fifth and last experiment asks individuals to estimate the performance of others.
So observers are more pessimistic. Actually, observers are so pessimistic that you have to average it with the optimistic estimates to get an unbiased estimate.
At the end of the day, there are certain things that are known about scheduling / prediction.
In general, individuals are as wrong as they are right for any given estimate.
In general, people are overly optimistic.
But, estimates generally correlate well with actual duration—if an individual thinks something is longer in estimate than another task, it most likely is! This is why in SW sometimes estimation is not in units of time at all, but in a concept called “points”.
The larger and more nebulously scoped the task, the worse any estimates will be in absolute error.
The length of a time a task can take follows a distribution with a very long right tail—a task that takes way longer than expected can take an arbitrary amount of time, but the fastest time to complete a task is limited.
The best way to actually schedule or predict a project is to break it down into as many small component tasks as possible, identify dependencies between those tasks, and produce most likely, optimistic, and pessimistic estimates for each task, and then run a simulation for chain of dependencies to see what the expected project completion looks like. Use a Gantt chart. This is a boring answer because it’s the “learn project management” answer, and people will hate on it because
gesture vaguely to all of the projects that overrun their schedule
. There are many interesting reasons for why that happens and why I don’t think it’s a massive failure of rationality, but I’m not sure this comment is a good place to go into detail on that. The quick answer is that comical overrun of a schedule has less to do with an inability to create correct schedules from an engineering / evidence-based perspective, and much more to do with a bureaucratic or organizational refusal to accept an evidence-based schedule when a totally false but politically palatable “optimistic” schedule is preferred.I definitely agree that this is the way to get the most accurate prediction practically possible, and that organizational dysfunction often means this isn’t used, even when the organization would be better able to achieve its goals with an accurate prediction. But I also think that depending on the type of project, producing an accurate Gantt chart may take a substantial fraction of the effort (or even a substantial fraction of the wall-clock time) of finishing the entire project, or may not even be possible without already having some of the outputs of the processes earlier in the chart. These aren’t necessarily possible to eradicate, so the take-away, I think, is not to be overly optimistic about the possibility of getting accurate schedules, even when there are no ill intentions and all known techniques to make more accurate schedules are used.
It’s interesting that the median of the pessimistic expectations is about equal to the median of the actual results. The mean clearly wasn’t, as that discrepancy was literally the point of citing this statistic in the OP:
So the estimates were biased, but not median-biased (at least that’s what Wikipedia appears to say the terminology is). Less biased than other estimates, though. Of course this assumes we’re taking the answer to “how long would it take if everything went as poorly as it possibly could” and interpreting it as the answer to “how long will it actually take”, and if students were actually asked after the fact if everything went as poorly as it possibly could, I predict they would mostly say no. And treating the text “if everything went as poorly as it possibly could” as if it wasn’t even there is clearly wrong too, because they gave a different (more biased towards optimism) answer if it was omitted.
This specific question seems kind of hard to make use of from a first-person perspective. But I guess maybe as a third party one could ask for worst-possible estimates and then treat them as median-unbiased estimators of what will actually happen? Though I also don’t know if the median-unbiasedness is a happy accident. (It’s not just a happy accident, there’s something there, but I don’t know whether it would generalize to non-academic projects, projects executed by 3rd parties rather than oneself, money rather than time estimates, etc.)
I do still also think there’s a question of how motivated the students were to give accurate answers, although I’m not claiming that if properly motivated they would re-invent Murphyjitsu / the pre-mortem / etc. from whole cloth; they’d probably still need to already know about some technique like that and believe it could help get more accurate answers. But even if a technique like that is an available action, it sounds like a lot of work, only worth doing if the output has a lot of value (e.g. if one suspects a substantial chance of not finishing the thesis before it’s due, one might wish to figure out why so one could actively address some of the reasons).