I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.
This sounds true, and that’s disturbing.
I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.