RFC on an open problem: how to determine probabilities in the face of social distortion
So, I have a problem that I want help with, but I don’t want to focus on the object-level thing except as a particularly sharp example of the abstract problem. Please only throw suggestions here if you have a reasonable understanding of Bayes’ rule, *and* grok the problem I’m pointing at.
First, the problem I’m ACTUALLY trying to solve: If I believe I solved a practical problem to the satisfaction of its stakeholders, what is the probability that my belief is correct?
Now, the problem *around* the problem I’m actually trying to solve:
I can look at a solution I implemented to a practical problem, and judge whether I’m satisfied with it or not. BUT, often there are more actual stakeholders than just me. These stakeholders ALSO have opinions about whether I implemented the solution to their satisfaction.
Note I said “have opinions about”, *not* “have information about”. This is an important distinction, as you’re about to see.
So. Let’s say I have a confidence probability of p=0.65 that I performed a particular task well.
There are five other identified stakeholders, and their averaged confidence is something like p=0.4 that I performed that task well.
I have some weighting process that says whose opinion I take more seriously, so I adjust my confidence down to p=0.55.
I notice that this adjusts their averaged confidence downward, to something like p=0.35.
I perform more tasks, and poll more. I notice that people’s opinion of my performance tracks mine, but is invariably adjusted downward by about 33%. So I adjust my opinion of my performance downward by about 30%.
Magically, this causes their opinion of my performance to drop by about 25%.
After taking enough samples, I start deliberately adjusting my opinion of my performance *upwards* massively. If I feel a 0.7 probability that I satisfied the requirement, I convince myself that it’s actually an 0.9 probability. Consistently, other people start treating my performance as if it were higher quality—but they ALSO treat me as if I’m an arrogant ass who can’t judge the quality of my performance. Because they’re still setting their own opinions to something like 33% below mine, but when they translate the probability into something like ‘confidence weight’ (which I assume is some kind of log-odds transform), being 33% less confident than me FEELS like a much bigger gap when I’m 90% sure and they’re 60% sure, than when I’m 65% sure and they’re 45% sure.
I’m convinced that there’s SOME signal in the noise of their confidence in my ability, but so much of it seems coupled to my OWN confidence in my ability, that I can’t tell.
So, how do I calibrate?
I have three points of frustration here, that I wish to caution responders about:
1. A meta-level problem I have with these situations, is that I often notice that people are spending more time trying to convince me that they’re using something like Bayes’ rule, than they are actually generating observable evidence that they’re using something like Bayes’ rule. So assume that at least one of us is miscalibrated about being miscalibrated, *AND* that we’re collectively miscalibrated about being miscalibrated about being miscalibrated.
2. Many people, when I talk about this, tell me “oh, well *I* don’t take your confidence and adjust downward, so please exclude me from the list of people who do.” Most of the people who tell me this, are in fact doing so. This means that even if you AREN’T doing so, saying those words is not evidence that you aren’t doing so, and in fact is weak evidence that you are. If you decide to emit those words anyways, I will assume you didn’t actually grok the Sequences or the correct bits of HPMOR, and I will discount your advice accordingly.
3. Most people, when they try to describe the process they think they use to arrive at a confidence level that I solved a problem, craft a narrative story about why some evidence is relevant and other evidence isn’t. These narratives suspiciously change from situation to situation, such that different bits of evidence are relevant in one case and not in another, in a way that *looks* highly motivated to arrive at numbers that *appear* to actually just be tracking my own confidence level. Most people react with offense or frustration rather than curiosity when I ask them what they’re using to determine which evidence is relevant and which isn’t. Don’t be that guy.
I don’t have immediate and useful insights or revelations, but the one thing that really jumped out for me, reading this, was the sense that two things are being conflated: people’s confidence that the quality of the work met or exceeded Bar X, and people’s different senses of where Bar X is.
I don’t know for sure that this will end up being a useful avenue to pursue, to find solutions, but my claim is something like “the real problem isn’t just people disagreeing about whether you met the bar, it’s about people disagreeing about the actual grade out of 100, with ‘what makes for a passing grade here?’ being an additional, second, obfuscating question.”
Regarding grades and bars:
Let’s say I’m building a giant sphere in the desert.
Lots of people can grade my design , but ultimately I want to know the probability that it’s going to buckle, especially while under load.
Or, say I’m helping run a youth camp on X-risk.
Lots of people can grade my performance, but ultimately I want to know whether the event successfully taught the kids anything useful, especially related to saving the world.
In the case of the sphere, people can say “yeah it didn’t buckle, but that’s just because you were lucky” . In the case of the youth camp, people can say “yeah it worked, but that was despite you not because of you” .
But what I want to know is, if we ran ten thousand youth camps, and we put me in five thousand of them and kept me out of five thousand of them, what would they look like?
Or if we built sixteen thousand spheres, and eight thousand of them used grade 8 bolts and eight thousand used grade 2 bolts, how many on each side buckle, and how many injuries are inflicted in each bucket?
Because “grade my performance” is, and always has been, utter fucking bullshit. I care about you (you in particular, Duncan) grading my performance because I trust you to know enough to calibrate my grade to the numbers I ACTUALLY care about , most of the time. And when I don’t trust you to calibrate, I stop caring about your grade, because I don’t need anyone to suck my dick about how well I did.
But I DO care about whether what I’m doing is actually doing good for people, and most of the time I have less evidence of whether that’s true than they do—but most of the time they seem to not be attending to that evidence themselves; instead, they’re checking to see if I’m behaving like the sort of monkey that “should” be seen as doing good things. And that is UTTERLY INFURIATING.
I think the core of the problem is about developing good metrics for evaluating your performance that are less vague than “satisfaction of all stakeholders”.
When you are a computer programmer you can make predictions about:
Whether your code will pass your tests
Whether there will be a request to change something about your code when it’s peer-reviewed
Whether your code will break the built
Whether after your code gets live, some business relevant metric will change.
There’s the old Drucker maxim of “What’s measured gets improved.” If you do a bigger project than it’s useful to make metrics together with other stakeholders that measure whether the product is successful.
When it comes to social interaction, I sometimes make predictions whether a person will hang out with me when I request that we meet. I could do the same for other request where I want another person to do something.