ryan_greenblatt comments on Mark Xu’s Shortform

ryan_greenblatt Oct 7, 2024, 1:00 AM
LW: 4 AF: 4
0
AF
On (a) and (b), we describe this at a high level here.

We don’t really have anything written on (c) or (d). (c) really depends a lot on effort, so I’d probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.

For (a), I think we potentially care about all of:
1. Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren’t higher due to human bottlenecks). (On a nearcast, we’d expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
2. Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I’d guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
3. Systems which are qualitatively smarter than any human by a small amount.
It’s likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.

On (b) we plan on talking more about this soon. (Buck’s recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
What links here?
- Orpheus16's comment on Sabotage Evaluations for Frontier Models by David Duvenaud (Oct 20, 2024, 2:56 PM; 10 points)
- Satron Nov 15, 2024, 8:55 PM
  1 point
  0
  Parent
  Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn’t this fact spell doom for humanity?
  - ryan_greenblatt Nov 16, 2024, 12:03 AM
    9 points
    7
    Parent
    By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
    
    AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
    
    The main hope I think about is something like:
    
    Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
    Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
    Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
    
    We discuss our hopes more in this post.