I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven’t seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
FWIW, I don’t think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I’ve seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
On (a) and (b), we describe this at a high level here.
We don’t really have anything written on (c) or (d). (c) really depends a lot on effort, so I’d probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.
For (a), I think we potentially care about all of:
Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren’t higher due to human bottlenecks). (On a nearcast, we’d expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I’d guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
Systems which are qualitatively smarter than any human by a small amount.
It’s likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.
On (b) we plan on talking more about this soon. (Buck’s recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn’t this fact spell doom for humanity?
By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
I think AI control agendas are defined in such a way such that this metric isn’t as relevant as you think it is:
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools.
Because the agenda isn’t trying to make AIs alignable, but to make them useful and not break out of labs, so the question of the timeline to unaligned AI is less relevant than it is for most methods of making safe AI.
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven’t seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
(I go into this and various related things in my dialogue with Ryan on control)
FWIW, I don’t think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I’ve seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
On (a) and (b), we describe this at a high level here.
We don’t really have anything written on (c) or (d). (c) really depends a lot on effort, so I’d probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.
For (a), I think we potentially care about all of:
Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren’t higher due to human bottlenecks). (On a nearcast, we’d expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I’d guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
Systems which are qualitatively smarter than any human by a small amount.
It’s likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.
On (b) we plan on talking more about this soon. (Buck’s recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn’t this fact spell doom for humanity?
By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
We discuss our hopes more in this post.
Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.
Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
I think AI control agendas are defined in such a way such that this metric isn’t as relevant as you think it is:
Because the agenda isn’t trying to make AIs alignable, but to make them useful and not break out of labs, so the question of the timeline to unaligned AI is less relevant than it is for most methods of making safe AI.