I don’t think it’s particularly relevant to the problems this post is talking about, since things like “how do we evaluate success?” or “what questions should we even be asking?” are core to the problem; we usually don’t have lots of feedback cycles with clear, easy-to-evaluate outcomes. (The cases where we do have lots of feedback cycles with clear, easy-to-evaluate outcomes tend to be the “easy cases” for expert evaluation, and those methods you linked are great examples of how to handle the problem in those cases.)
Drawing from some of the examples:
Evaluating software engineers is hard because, unless you’re already an expert, you can’t just look at the code or the product. The things which separate the good from the bad mostly involve long-term costs of maintenance and extensibility.
Evaluating product designers is hard because, unless you’re already an expert, you won’t consciously notice the things which matter most in a design. You’d need to e.g. a/b test designs on a fairly large user base, and even then you need to be careful about asking the right questions to avoid Goodharting.
In the smallpox case, the invention of clinical trials was exactly what gave us lots of clear, easy-to-evaluate feedback on whether things work. Louis XV only got one shot, and he didn’t have data on hand from prior tests.
Money should be able to guarantee that, over several periods of play, you perform not-too-much-worse than an actual expert. Here: https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/notes/lecture16.pdf is a paper about an idealized CS-version of this problem.
Cool piece!
I don’t think it’s particularly relevant to the problems this post is talking about, since things like “how do we evaluate success?” or “what questions should we even be asking?” are core to the problem; we usually don’t have lots of feedback cycles with clear, easy-to-evaluate outcomes. (The cases where we do have lots of feedback cycles with clear, easy-to-evaluate outcomes tend to be the “easy cases” for expert evaluation, and those methods you linked are great examples of how to handle the problem in those cases.)
Drawing from some of the examples:
Evaluating software engineers is hard because, unless you’re already an expert, you can’t just look at the code or the product. The things which separate the good from the bad mostly involve long-term costs of maintenance and extensibility.
Evaluating product designers is hard because, unless you’re already an expert, you won’t consciously notice the things which matter most in a design. You’d need to e.g. a/b test designs on a fairly large user base, and even then you need to be careful about asking the right questions to avoid Goodharting.
In the smallpox case, the invention of clinical trials was exactly what gave us lots of clear, easy-to-evaluate feedback on whether things work. Louis XV only got one shot, and he didn’t have data on hand from prior tests.