Håvard Tveit Ihle comments on Introducing the WeirdML Benchmark

Håvard Tveit Ihle Jan 16, 2025, 6:26 PM
1 point
0
Thank you!

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

What we would really want is to have several top researchers/ml engineers do it, and I know that METR is working on that, so that is probably the best source we have for a realistic comparison at the moment.
- LawrenceC Jan 16, 2025, 7:35 PM
  2 points
  0
  Parent
  It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.
  My guess is it’s <1 hour per task assuming just copilot access, and much less if you’re allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you’d want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
  I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.
  Is the reason you can’t do one of the existing tasks, just to get a sense of the difficulty?
  - Håvard Tveit Ihle Jan 16, 2025, 9:29 PM
    1 point
    0
    Parent
    My guess is it’s <1 hour per task assuming just copilot access, and much less if you’re allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you’d want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
    I guess I was thinking that the human baseline should be without LLMs, because otherwise I could just forward the prompt to the best LLM, se what they did, and perhaps improve upon it, which would put human level always at or above the best LLM.
    Then again this is not how humans typically work now, so it’s unclear what is a «fair» comparison. I guess it depends on what the human baseline is supposed to represent, and you have probably thought a lot about that question at METR.
    Is the reason you can’t do one of the existing tasks, just to get a sense of the difficulty?
    I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works.
    Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.
- [ ]
  
  [deleted]