Martín Soto comments on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto 17 Jul 2024 17:47 UTC
8 points
0
Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as “hard for LLMs” (like tasks related to tokens and text position).
- Owain_Evans 18 Jul 2024 7:35 UTC
  5 points
  0
  Parent
  I like this idea. It’s possible something like this already exists but I’m not aware of it.
- jan betley 18 Jul 2024 12:00 UTC
  2 points
  0
  Parent
  This is interesting! Although I think it’s pretty hard to use that in a benchmark (because you need a set of problems assigned to clearly defined types and I’m not aware of any such dataset).
  
  There are some papers on “do models know what they know”, e.g. https://arxiv.org/abs/2401.13275 or https://arxiv.org/pdf/2401.17882.
  - Martín Soto 18 Jul 2024 19:24 UTC
    2 points
    0
    Parent
    you need a set of problems assigned to clearly defined types and I’m not aware of any such dataset
    Hm, I was thinking something as easy to categorize as “multiplying numbers of n digits”, or “the different levels of MMLU” (although again, they already know about MMLU), or “independently do X online (for example create an account somewhere)”, or even some of the tasks from your paper.
    I guess I was thinking less about “what facts they know”, which is pure memorization (although this is also interesting), and more about “cognitively hard tasks”, that require some computational steps.
    - Owain_Evans 19 Jul 2024 6:20 UTC
      2 points
      0
      Parent
      You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but “doing hard math questions” is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.