the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
Sure, but I guess I would say that we’re back to nebulous territory then—how much longer than six months? When if ever does the pause end?
a) near term (eval-less) scaling poses trivial xrisk
I agree that this is mostly baked in, but I think I’m pretty happy to accept it. I’d very surprised if there was substantial x-risk from the next model generation.
But also I would argue that, if the next generation of models do pose an x-risk, we’ve mostly already lost—we just don’t yet have anything close to the sort of regulatory regime we’d need to deal with that in place. So instead I would argue that we should be planning a bit further ahead than that, and trying to get something actually workable in place further out—which should also be easier to do because of the dynamic where organizations are more willing to sacrifice potential future value than current realized value.
b) there is a substantial period during which models trigger evals but are existentially safe
Yeah, I agree that this is tricky. Theoretically, since we can set the eval bar at any capability level, there should exist capability levels that you can eval for and that are safe but scaling beyond them is not. The problem, of course, is whether we can effectively identify the right capabilities levels to evaluate in advance. The fact that different capabilities are highly correlated with each other makes this easier in some ways—lots of different early warning signs will all be correlated—but harder in other ways—the dangerous capabilities will also be correlated, so they could all come at you at once.
Probably the most important intervention here is to keep applying your evals while you’re training your next model generation, so they trigger as soon as possible. As long as there’s some continuity in capabilities, that should get you pretty far. Another thing you can do is put strict limits on how much labs are allowed to scale their next model generation relative to the models that have been definitively evaluated to be safe. And furthermore, my sense is that at least in the current scaling paradigm, the capabilities of the next model generation tend to be relatively predictable given the current model generation.
So overall, my sense is that takeoff only has to be marginally continuous for this to work—if it’s extremely abrupt, more of a classic FOOM scenario, then you might have problems, but I think that’s pretty unlikely.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
Thanks! Happy to chat about this more also offline.
Sure, but I guess I would say that we’re back to nebulous territory then—how much longer than six months? When if ever does the pause end?
i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.
I’d very surprised if there was substantial x-risk from the next model generation.
i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don’t trust my (or anyone else’s) intuitons strongly enough to say that there’s a less than 1% xrisk per 10x scaling of compute. in expectation, that’s killing 80M existing people—people who are unaware that this is happening to them right now.
if the next generation of models do pose an x-risk, we’ve mostly already lost—we just don’t yet have anything close to the sort of regularity regime we’d need to deal with that in place
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I’m asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few—in my understanding, only 3 or 4 labs have any chance of releasing a “next generation” model for as much as two years from now, others won’t be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite—for voluntary actions taken by the labs, and that regulation can follow.
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?
This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences—including Agents for Amazon Bedrock, which our team has been instrumental in developing.
Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform API calls using the popular AWS Lambda functions. Through this feature, Claude can take on a more expanded role as an agent to understand user requests, break down complex tasks into multiple steps, carry on conversations to collect additional details, look up information, and take actions to fulfill requests. For example, an e-commerce app that offers a chat assistant built with Claude can go beyond just querying product inventory – it can actually help customers update their orders, make exchanges, and look up relevant user manuals.
Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, but it wouldn’t be surprising if building an ecosystem of more powerful AI agents increased the risk that autonomous AI agents cause catastrophic harm.
Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I’d be very keen to know what it could do!
the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
Sure, but I guess I would say that we’re back to nebulous territory then—how much longer than six months? When if ever does the pause end?
I agree that this is mostly baked in, but I think I’m pretty happy to accept it. I’d very surprised if there was substantial x-risk from the next model generation.
But also I would argue that, if the next generation of models do pose an x-risk, we’ve mostly already lost—we just don’t yet have anything close to the sort of regulatory regime we’d need to deal with that in place. So instead I would argue that we should be planning a bit further ahead than that, and trying to get something actually workable in place further out—which should also be easier to do because of the dynamic where organizations are more willing to sacrifice potential future value than current realized value.
Yeah, I agree that this is tricky. Theoretically, since we can set the eval bar at any capability level, there should exist capability levels that you can eval for and that are safe but scaling beyond them is not. The problem, of course, is whether we can effectively identify the right capabilities levels to evaluate in advance. The fact that different capabilities are highly correlated with each other makes this easier in some ways—lots of different early warning signs will all be correlated—but harder in other ways—the dangerous capabilities will also be correlated, so they could all come at you at once.
Probably the most important intervention here is to keep applying your evals while you’re training your next model generation, so they trigger as soon as possible. As long as there’s some continuity in capabilities, that should get you pretty far. Another thing you can do is put strict limits on how much labs are allowed to scale their next model generation relative to the models that have been definitively evaluated to be safe. And furthermore, my sense is that at least in the current scaling paradigm, the capabilities of the next model generation tend to be relatively predictable given the current model generation.
So overall, my sense is that takeoff only has to be marginally continuous for this to work—if it’s extremely abrupt, more of a classic FOOM scenario, then you might have problems, but I think that’s pretty unlikely.
Thanks! Happy to chat about this more also offline.
i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.
i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don’t trust my (or anyone else’s) intuitons strongly enough to say that there’s a less than 1% xrisk per 10x scaling of compute. in expectation, that’s killing 80M existing people—people who are unaware that this is happening to them right now.
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I’m asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few—in my understanding, only 3 or 4 labs have any chance of releasing a “next generation” model for as much as two years from now, others won’t be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite—for voluntary actions taken by the labs, and that regulation can follow.
Probably, but Anthropic is actively working in the opposite direction:
Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, but it wouldn’t be surprising if building an ecosystem of more powerful AI agents increased the risk that autonomous AI agents cause catastrophic harm.
Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I’d be very keen to know what it could do!
maybe “when alignment is solved”