[Question] What are the most interesting /​ challenging evals (for humans) available?

I want to build a nice testing ground for human rationality (that is to say, solving arbitrary complex problems in different domains using limited information and time)

This would be a lot of annoying work to assemble, but (un)fortunately there’s an AI industry that’s designing a lot bunch of evals to test the ability of their AIs to solve arbitrary complex problems in arbitrary domains sooo.… anyone have good recommendations for public eval questions?

I started my project using Thinking Physics exercises. I think there is something particularly nice about the quality of Thinking Physics exercises (in that they require conceptual reasoning), but, they are only one domain, and also I had trouble getting the author to sell me rights to them.

I’ve used GPQA. They didn’t turn out to be as interesting as I expected (they’re not bad, but, the skill ceiling didn’t turn out to be as high as I thought based on the description).

Evals/​Benchmarks are generally kept off the public internet. My plan is to use these on a website that requires a login and can include a canary string, but I’d be interested in questions that have been publicly released, or the benchmark saturated, or whatever seems appropriate for cooperating with people’s intentions.

Do people have particular recommendations and/​or any knowledge about what they expect to work well here?

ADDENDA:

I’m happy for now with “more interesting and harder and more varied-in-required-skillset than GPQA” but my ideal has problems that would take a particularly smart(like 95th percentile alignment researcher, i.e. there are somewhere between 10-30 people who might count) x-risk researcher 30 minutes or so, and the median x-risk researcher more like 2-8 hours, to reliably get right on their first try (with maybe would have a 50% change of figuring it out in 1-3 hours).

The ideal is that people have to:

  1. go through a period of planning, and replanning

  2. spend at least some time feeling like the problem is totally opaque and they don’t have traction.

  3. have to reach for tools that they don’t normally reach for.

It may be that we just don’t have evals at this level yet, and I might take what I can get, but, it’s what I’m aiming for.

I’m not trying to make an IQ test – my sense from the literature is that you basically can’t raise IQ through training. So many people have tried. This is very weird to me – subjectively it is just really obvious to me that I’m flexibly smarter in many ways than I was in 2011 when I started the rationality project, and this is due to me having a lot of habits I didn’t used to have. The hypotheses I currently have are:

  • You just have to be really motivated to do transfer learning, and a genuinely inspiring /​ good teacher, and it’s just really hard to replicate this sort of training scientifically

  • IQ is mostly measuring “fast intelligence”, because that’s what cost-effective to measure in large enough quantities to get a robust sample. i.e. it measures whether you can solve questions in like a few minutes which mostly depends on you being able to intuitively get it. It doesn’t measure your ability to figure out how to figure something out that requires longterm planning, which would allow a lot of planning skills to actually come into play.

Both seem probably at least somewhat true, but the latter one feels like a clearer story for why there would be potential (at least theoretically) in the space I’m exploring – IQ test take a few hours to take. It would be extremely expensive to do the theoretical statistically valid version of the thing I’m aiming at.

My explicit goal here is to train researchers who are capable of doing the kind of work necessary in worlds where Yudkowsky is right about the depth/​breadth of alignment difficulty.

  1. ^

    (like 95th percentile alignment researcher, i.e. there are somewhere between 10-30 people who might count)