mishka comments on mishka’s Shortform

mishka 24 Nov 2024 7:25 UTC
38 points
3
METR releases a report, Evaluating frontier AI R&D capabilities of language model agents against human experts: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/

Daniel Kokotajlo and Eli Lifland both feel that one should update towards shorter timelines remaining until the start of rapid acceleration via AIs doing AI research based on this report:

https://x.com/DKokotajlo67142/status/1860079440497377641

https://x.com/eli_lifland/status/1860087262849171797
- Leon Lang 24 Nov 2024 11:58 UTC
  26 points
  17
  Parent
  Somewhat pedantic correction: they don’t say “one should update”. They say they update (plus some caveats).
  - mishka 24 Nov 2024 14:10 UTC
    4 points
    0
    Parent
    Indeed
- Cole Wyeth 24 Nov 2024 22:31 UTC
  8 points
  1
  Parent
  I’d like to see the x-axis on this plot scaled by a couple OOMs on a task that doesn’t saturate: https://metr.org/assets/images/nov-2024-evaluating-llm-r-and-d/score_at_time_budget.png My hunch (and a timeline crux for me) is that human performance actually scales in a qualitatively different way with time, doesn’t just asymptote like LLM performance. And even the LLM scaling with time that we do see is an artifact of careful scaffolding. I am a little surprised to see good performance up to the 2 hour mark though. That’s longer than I expected. Edit: I guess only another doubling or two would be reasonable to expect.
  - Leon Lang 25 Nov 2024 22:38 UTC
    2 points
    0
    Parent
    Yeah I think that’s a valid viewpoint.
    Another viewpoint that points in a different direction: A few years ago, LLMs could only do tasks that require humans ~minutes. Now they’re at the ~hours point. So if this metric continues, eventually they’ll do tasks requiring humans days, weeks, months, …
    I don’t have good intuitions that would help me to decide which of those viewpoints is better for predicting the future.
    - Cole Wyeth 25 Nov 2024 23:33 UTC
      3 points
      0
      Parent
      One reason to prefer my position is that LLM’s still seem to be bad at the kind of tasks that rely on using serial time effectively. For these ML research style tasks, scaling up to human performance over a couple of hours relied on taking the best of multiple calls, which seems like parallel time. That’s not the same as leaving an agent running for a couple of hours and seeing it work out something it previously would have been incapable of guessing (or that really couldn’t be guessed, but only discovered through interaction). I do struggle to think of tests like this that I’m confident an LLM would fail though. Probably it would have trouble winning a text based RPG? Or more practically speaking, could an LLM file my taxes without committing fraud? How well can LLM’s play board games these days?