This is interesting! Although I think it’s pretty hard to use that in a benchmark (because you need a set of problems assigned to clearly defined types and I’m not aware of any such dataset).
you need a set of problems assigned to clearly defined types and I’m not aware of any such dataset
Hm, I was thinking something as easy to categorize as “multiplying numbers of n digits”, or “the different levels of MMLU” (although again, they already know about MMLU), or “independently do X online (for example create an account somewhere)”, or even some of the tasks from your paper.
I guess I was thinking less about “what facts they know”, which is pure memorization (although this is also interesting), and more about “cognitively hard tasks”, that require some computational steps.
You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but “doing hard math questions” is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.
This is interesting! Although I think it’s pretty hard to use that in a benchmark (because you need a set of problems assigned to clearly defined types and I’m not aware of any such dataset).
There are some papers on “do models know what they know”, e.g. https://arxiv.org/abs/2401.13275 or https://arxiv.org/pdf/2401.17882.
Hm, I was thinking something as easy to categorize as “multiplying numbers of n digits”, or “the different levels of MMLU” (although again, they already know about MMLU), or “independently do X online (for example create an account somewhere)”, or even some of the tasks from your paper.
I guess I was thinking less about “what facts they know”, which is pure memorization (although this is also interesting), and more about “cognitively hard tasks”, that require some computational steps.
You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but “doing hard math questions” is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.