I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.
It’s runnable from the command-line (after repo install) with the command:
oaieval human_cli function_deduction
(I believe that’s the right second term, but it might just be humancli)
The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.
For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)
Tried running but I got [eval.py:233] Running in threaded mode with 10 threads! which makes it unplayable for me (because it is trying to make me to 10 tests alternating
Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?
I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.
It’s runnable from the command-line (after repo install) with the command:
oaieval human_cli function_deduction
(I believe that’s the right second term, but it might just be humancli)The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.
For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)
Tried running but I got
[eval.py:233] Running in threaded mode with 10 threads!
which makes it unplayable for me (because it is trying to make me to 10 tests alternatingOh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?
https://github.com/openai/evals/blob/main/evals/eval.py#L211
(There are two places in this file where
threads =
, would change 10 to 1 in each)Don’t really want to touch the packages, but just setting the EVALS_THREADS environmental variable worked
Great! Appreciate you letting me know & helping debug for others