This unfortunately complicates the issue. It’s not clearly a suicide race. I think we have to accept this uncertainty to propose workable policy and societal approaches. Having plans that might work does not justify rushing full-speed ahead without a full safety case, but it must be acknowledged if true, because hope of human-controlled AGI will drive a lot of relevant actors.
An additional point here is the “let’s look more closely at the actual thing, then decide” type of mindset that people may be using.
If you are in the camp that assumes that you will be able to safely create potent AGI in a contained lab scenario, and then you’d want to test it before deploying it in the larger world… Then there’s a number of reasons you might want to race and not believe that the race is a suicide race.
Some possible beliefs downstream of this:
My team will evaluate it in the lab, and decide exactly how dangerous it is, without experiencing much risk (other than leakage risk).
We will test various control methods, and won’t deploy the model on real tasks until we feel confident that we have it sufficiently controlled. We are confident we won’t make a mistake at this step and kill ourselves.
We want to see empirical evidence in the lab of exactly how dangerous it is. If we had this evidence, and knew that other people we didn’t trust were getting close to creating a similarly powerful AI, this would guide our policy decisions about how to interact with these other parties. (E.g. what treaties to make, what enforcement procedures would be needed, what red lines would need to be drawn)
For people in this mindset, they may not be discouraged from racing even if you convinced them that there was approximately no chance that they’d be able to safely deploy a controlled version of the AI system. They’d still want an example of the thing in a lab to study it, and to use this evidence to help them decide if they need to freak out about their political enemies having their own copy. The more dangerous you convince them it is, the more resources they will devote to racing, unless you convince them that it will escape their control in the lab.
On the wait-and-see attitude, which is maybe the more important part of your point:
I agree that a lot of people are taking a wait-and-see-what-we-actually-create stance. I don’t think that’s a good idea with something this important. I think we should be doing our damndest to predict what it will be like and what should be done about it while there’s still a little spare time. Many of those predictions will be wrong, but they will at least produce some analysis of what the sane actions are in different scenarios of the type of AGI we create. And as we get closer, some of the predictions might be right enough to actually help with alignment plans. I share Max’s conviction that we have a pretty good guess at the form the first AGIs will take on the current trajectory.
I think there are less cautious plans for containment that are more likely to be enacted, e.g., the whole “control” line of work or related network security approaches. The slow substrate plan seems to have far too high an alignment tax to be a realistic option.
Yes, I am inclined to agree with that take. At least, I think that’s how things will go at first. I think once a level is hit where there is clear empirical evidence of substantial immediate danger, then people will be willing to accept a higher alignment tax for the purposes of carefully researching the dangerous AI in a controlled lab. Start with high levels of noise injection and slowdown, then gradually relax these as you do continual testing. Find the sweet spot where you can be confident you are fully in control with only the minimum necessary alignment tax.
The question then, in my mind, is how much of a gap will there be between the levels of control and the levels of AI development? Will we sanely keep ahead of the curve, starting with high levels of control in initial testing then backing off gradually to a safe point? That would be the wise thing to do. Will we be correct in our judgements of what a safe level is?
Or will we act too late, deciding to increase the level of control only once an incident has occurred? The first incident could well be the last, if it is an escape of a rogue AI capable of strategic planning and self-improvement.
I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.
An additional point here is the “let’s look more closely at the actual thing, then decide” type of mindset that people may be using.
If you are in the camp that assumes that you will be able to safely create potent AGI in a contained lab scenario, and then you’d want to test it before deploying it in the larger world… Then there’s a number of reasons you might want to race and not believe that the race is a suicide race.
Some possible beliefs downstream of this:
My team will evaluate it in the lab, and decide exactly how dangerous it is, without experiencing much risk (other than leakage risk).
We will test various control methods, and won’t deploy the model on real tasks until we feel confident that we have it sufficiently controlled. We are confident we won’t make a mistake at this step and kill ourselves.
We want to see empirical evidence in the lab of exactly how dangerous it is. If we had this evidence, and knew that other people we didn’t trust were getting close to creating a similarly powerful AI, this would guide our policy decisions about how to interact with these other parties. (E.g. what treaties to make, what enforcement procedures would be needed, what red lines would need to be drawn)
For people in this mindset, they may not be discouraged from racing even if you convinced them that there was approximately no chance that they’d be able to safely deploy a controlled version of the AI system. They’d still want an example of the thing in a lab to study it, and to use this evidence to help them decide if they need to freak out about their political enemies having their own copy. The more dangerous you convince them it is, the more resources they will devote to racing, unless you convince them that it will escape their control in the lab.
On the wait-and-see attitude, which is maybe the more important part of your point:
I agree that a lot of people are taking a wait-and-see-what-we-actually-create stance. I don’t think that’s a good idea with something this important. I think we should be doing our damndest to predict what it will be like and what should be done about it while there’s still a little spare time. Many of those predictions will be wrong, but they will at least produce some analysis of what the sane actions are in different scenarios of the type of AGI we create. And as we get closer, some of the predictions might be right enough to actually help with alignment plans. I share Max’s conviction that we have a pretty good guess at the form the first AGIs will take on the current trajectory.
For some examples of why it makes sense to think that potent AI could be safely studied in the lab, see this comment and the post it is in relation to: https://www.lesswrong.com/posts/qhhRwxsef7P2yC2Do/ai-alignment-via-slow-substrates-early-empirical-results?commentId=eM7b9QxJSsFn28opC
I think there are less cautious plans for containment that are more likely to be enacted, e.g., the whole “control” line of work or related network security approaches. The slow substrate plan seems to have far too high an alignment tax to be a realistic option.
Yes, I am inclined to agree with that take. At least, I think that’s how things will go at first. I think once a level is hit where there is clear empirical evidence of substantial immediate danger, then people will be willing to accept a higher alignment tax for the purposes of carefully researching the dangerous AI in a controlled lab. Start with high levels of noise injection and slowdown, then gradually relax these as you do continual testing. Find the sweet spot where you can be confident you are fully in control with only the minimum necessary alignment tax.
The question then, in my mind, is how much of a gap will there be between the levels of control and the levels of AI development? Will we sanely keep ahead of the curve, starting with high levels of control in initial testing then backing off gradually to a safe point? That would be the wise thing to do. Will we be correct in our judgements of what a safe level is?
Or will we act too late, deciding to increase the level of control only once an incident has occurred? The first incident could well be the last, if it is an escape of a rogue AI capable of strategic planning and self-improvement.
see my related comment here: https://www.lesswrong.com/posts/Kobbt3nQgv3yn29pr/my-theory-of-change-for-working-in-ai-healthtech?commentId=u6W2tjuhKyJ8nCwQG
This sounds true, and that’s disturbing.
I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.