Akash comments on Mark Xu’s Shortform

Akash 7 Oct 2024 0:36 UTC
LW: 4 AF: 3
0
AF
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
- ryan_greenblatt 7 Oct 2024 1:00 UTC
  LW: 4 AF: 4
  0
  AF Parent
  On (a) and (b), we describe this at a high level here.
  
  We don’t really have anything written on (c) or (d). (c) really depends a lot on effort, so I’d probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.
  
  For (a), I think we potentially care about all of:
  1. Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren’t higher due to human bottlenecks). (On a nearcast, we’d expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
  2. Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I’d guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
  3. Systems which are qualitatively smarter than any human by a small amount.
  It’s likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.
  
  On (b) we plan on talking more about this soon. (Buck’s recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
  What links here?
  - Akash's comment on Sabotage Evaluations for Frontier Models by David Duvenaud (20 Oct 2024 14:56 UTC; 10 points)
  - Satron 15 Nov 2024 20:55 UTC
    1 point
    0
    Parent
    Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn’t this fact spell doom for humanity?
    - ryan_greenblatt 16 Nov 2024 0:03 UTC
      9 points
      7
      Parent
      By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
      
      AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
      
      The main hope I think about is something like:
      
      Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
      Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
      Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
      
      We discuss our hopes more in this post.
- Buck 7 Oct 2024 14:37 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.
  
  Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.