maxnadeau comments on AI Timelines

maxnadeau Jan 8, 2025, 1:31 AM
LW: 7 AF: 5
0
AF
You mentioned CyBench here. I think CyBench provides evidence against the claim “agents are already able to perform self-contained programming tasks that would take human experts multiple hours”. AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on ⁰⁄₁₀ attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.
- elifland Jan 8, 2025, 1:40 AM
  LW: 5 AF: 4
  −1
  AF Parent
  I believe Cybench first solve times are based on the fastest top professional teams, rather than typical individual CTF competitors or cyber employees, for which the time to complete would probably be much higher (especially for the latter).
  - maxnadeau Jan 8, 2025, 1:50 AM
    LW: 5 AF: 4
    0
    AF Parent
    Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I’m intuitively skeptical.
    One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you’re describing.
    - elifland Jan 8, 2025, 1:55 AM
      LW: 8 AF: 5
      2
      AF Parent
      Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I’m intuitively skeptical.
      Yes, that would be my guess, medium confidence.
      One component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you’re describing.
      I’m skeptical of your skepticism. Not knowing basically anything about the CTF scene but using the competitive programming scene as an example, I think the median competitor is much more capable than the median software engineering professional, not less. People like competing at things they’re good at.
    - Neel Nanda Jan 9, 2025, 3:54 PM
      LW: 7 AF: 5
      2
      AF Parent
      I don’t know much about CTF specifically, but based on my maths exam/olympiad experience I predict that there’s a lot of tricks to go fast (common question archetypes, saved code snippets, etc) that will be top of mind for people actively practicing, but not for someone with a lot of domain expertise who doesn’t explicitly practice CTF. I also don’t know how important speed is for being a successful cyber professional. They might be able to get some of this speed up with a bit of practice, but I predict by default there’s a lot of room for improvement.