sjadler answers What are the most interesting / challenging evals (for humans) available?

sjadler 27 Dec 2024 7:28 UTC
4 points
3
I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.

It’s runnable from the command-line (after repo install) with the command: oaieval human_cli function_deduction (I believe that’s the right second term, but it might just be humancli)

The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.

For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)
- papetoast 28 Dec 2024 12:22 UTC
  2 points
  0
  Parent
  Tried running but I got [eval.py:233] Running in threaded mode with 10 threads! which makes it unplayable for me (because it is trying to make me to 10 tests alternating
  - sjadler 28 Dec 2024 18:55 UTC
    2 points
    0
    Parent
    Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?
    
    https://github.com/openai/evals/blob/main/evals/eval.py#L211
    
    (There are two places in this file where threads =, would change 10 to 1 in each)
    - papetoast 29 Dec 2024 2:35 UTC
      2 points
      0
      Parent
      Don’t really want to touch the packages, but just setting the EVALS_THREADS environmental variable worked
      - sjadler 29 Dec 2024 3:54 UTC
        2 points
        0
        Parent
        Great! Appreciate you letting me know & helping debug for others

sjadler answers What are the most interesting /​ challenging evals (for humans) available?

sjadler answers What are the most interesting / challenging evals (for humans) available?