If legibility of expertise is a bottleneck to progress and adequacy of civilization, it seems like creating better benchmarks for knowledge and expertise for humans might be a valuable public good. While that seems difficult for aesthetics, it seems easier for engineering? I’d rather listen to a physics PhD, who gets Thinking Physics questions right (with good calibration), years into their professional career, than one who doesn’t.
One way to do that is to force experts to make forecasts, but this takes a lot of time to hash out and even more time to resolve.
One idea I just had related to this: the same way we use datasets like MMLU and MMMU, etc. to evaluate language models, we use a small dataset like this and then experts are allowed to take the test and performance on the test is always public (and then you make a new test every month or year).
Maybe you also get some participants to do these questions in a quiz show format and put it on YouTube, so the test becomes more popular? I would watch that.
The disadvantage of this method compared to tests people prepare for in academia would be that the data would be quite noisy. On the other hand, this measure could be more robust to goodharting and fraud (although of course this would become a harder problem once someone actually cared about that test). This process would inevitably miss genius hedgehog’s of course, but maybe not their ideas, if the generalists can properly evaluate them.
There are also some obvious issues in choosing what kinds of questions one uses as representative.
If you have a best that actually measures expertise in engineering well, it’s going to be valuable for those who make hiring decisions.
Triplebyte essentially seems to have found a working business model that is about testing for expertise in programming. If you can do something similar as Triplebyte for other areas of expertise, that might be a good business model.
As far as genius hedgehog’s in academia go, currently they find it very hard to get funding for their ideas. If you would replace the current process of having to write a grant proposal with having to take a test to measure expertise, I would expect the diversity of ideas that get researched to increase.
Triplebyte? You mean, the software job interviewing company?
They had some scandal a while back where they made old profiles public without permission, and some other problems that I read about but can’t remember now.
They didn’t have a better way of measuring engineering expertise, they just did the same leetcode interviews that Google/etc did. They tried to be as similar as possible to existing hiring at multiple companies; the idea wasn’t better evaluation but reducing redundant testing. But companies kind of like doing their own testing.
They’re gone now, acquired by Karat. Which seems to be selling companies a way to make their own leetcode interviews using Triplebyte’s system, thus defeating the original point.
If legibility of expertise is a bottleneck to progress and adequacy of civilization, it seems like creating better benchmarks for knowledge and expertise for humans might be a valuable public good. While that seems difficult for aesthetics, it seems easier for engineering? I’d rather listen to a physics PhD, who gets Thinking Physics questions right (with good calibration), years into their professional career, than one who doesn’t.
One way to do that is to force experts to make forecasts, but this takes a lot of time to hash out and even more time to resolve.
One idea I just had related to this: the same way we use datasets like MMLU and MMMU, etc. to evaluate language models, we use a small dataset like this and then experts are allowed to take the test and performance on the test is always public (and then you make a new test every month or year).
Maybe you also get some participants to do these questions in a quiz show format and put it on YouTube, so the test becomes more popular? I would watch that.
The disadvantage of this method compared to tests people prepare for in academia would be that the data would be quite noisy. On the other hand, this measure could be more robust to goodharting and fraud (although of course this would become a harder problem once someone actually cared about that test). This process would inevitably miss genius hedgehog’s of course, but maybe not their ideas, if the generalists can properly evaluate them.
There are also some obvious issues in choosing what kinds of questions one uses as representative.
If you have a best that actually measures expertise in engineering well, it’s going to be valuable for those who make hiring decisions.
Triplebyte essentially seems to have found a working business model that is about testing for expertise in programming. If you can do something similar as Triplebyte for other areas of expertise, that might be a good business model.
As far as genius hedgehog’s in academia go, currently they find it very hard to get funding for their ideas. If you would replace the current process of having to write a grant proposal with having to take a test to measure expertise, I would expect the diversity of ideas that get researched to increase.
Triplebyte? You mean, the software job interviewing company?
They had some scandal a while back where they made old profiles public without permission, and some other problems that I read about but can’t remember now.
They didn’t have a better way of measuring engineering expertise, they just did the same leetcode interviews that Google/etc did. They tried to be as similar as possible to existing hiring at multiple companies; the idea wasn’t better evaluation but reducing redundant testing. But companies kind of like doing their own testing.
They’re gone now, acquired by Karat. Which seems to be selling companies a way to make their own leetcode interviews using Triplebyte’s system, thus defeating the original point.