I’ll echo the other commenters in saying this was interesting and valuable, but also (perhaps necessarily) left me to cross some significant inferential gaps. The biggest for me were in going from game-descriptions to equilibria. Maybe this is just a thing that can’t be made intuitive to people who haven’t solved it out? But I think that, e.g., graphs of the kinds of distributions you get in different cases would have helped me, at least.
I also had to think for a bit about what assumptions you were making here:
A more rigorous or multi-step process could have only done so much. To get better information, they would have had to add a different kind of test. That would risk introducing bad noise.
A very naive model says additional tests → uncorrelated noise → less noise in the average.
More realistically, we can assume that some dimensions of quality are easier to Goodhart than others, and you don’t know which are which beforehand. But then, how do you know your initial choice of test isn’t Goodhart-y? And even if the Goodhart noise is much larger than the true variation in skill, it seems like you can aggregate scores in a way that would allow you to make use of the information from the different tests without being bamboozled. (Depending on your use-case, you could take the average of a concave function of the scores, or use quantiles, or take the min score, etc.)
In reality, though, you usually have some idea what dimensions are important for the job. Maybe it’s something like PCA, with the noise/signal ratio of dimensions decreasing as you go down the list of components. Then that decrease, plus marginal costs of more tests, means that there is some natural stopping point. I guess that makes sense, but it took a bit for me to get there. Is that what you were thinking?
I’ll echo the other commenters in saying this was interesting and valuable, but also (perhaps necessarily) left me to cross some significant inferential gaps. The biggest for me were in going from game-descriptions to equilibria. Maybe this is just a thing that can’t be made intuitive to people who haven’t solved it out? But I think that, e.g., graphs of the kinds of distributions you get in different cases would have helped me, at least.
I also had to think for a bit about what assumptions you were making here:
A very naive model says additional tests → uncorrelated noise → less noise in the average.
More realistically, we can assume that some dimensions of quality are easier to Goodhart than others, and you don’t know which are which beforehand. But then, how do you know your initial choice of test isn’t Goodhart-y? And even if the Goodhart noise is much larger than the true variation in skill, it seems like you can aggregate scores in a way that would allow you to make use of the information from the different tests without being bamboozled. (Depending on your use-case, you could take the average of a concave function of the scores, or use quantiles, or take the min score, etc.)
In reality, though, you usually have some idea what dimensions are important for the job. Maybe it’s something like PCA, with the noise/signal ratio of dimensions decreasing as you go down the list of components. Then that decrease, plus marginal costs of more tests, means that there is some natural stopping point. I guess that makes sense, but it took a bit for me to get there. Is that what you were thinking?