I don’t really know what people mean when they try to compare “capabilities advancements” to “safety advancements”. In one sense, its pretty clear. The common units are “amount of time”, so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that’s a capabilities advance, and should not have been done. Yet I think there’s a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.
I also don’t often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
Yeah, I agree that releasing open-weights non-frontier models doesn’t seem like a frontier capabilities advance.
It does seem potentially like an open-source capabilities advance.
That can be bad in different ways.
Let me pose a couple hypotheticals.
What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it’s a dangerous thing to have open-weight models catching up.
What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren’t sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.
So, as we get closer to danger, open-weight models take on more safety significance.
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations.
Obviously such numbers aren’t the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance.
If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn’t exactly my main wheelhouse).
People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe—eg, because they don’t understand in detail how to reason about whether it is or not—are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don’t have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren’t. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don’t personally see how it’s exfohaz. And it won’t be apparent until afterwards that it was capabilities, not alignment.
So just don’t publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god’s sake stop accidentally helping people create green nodes because you can’t see five inches ahead. And don’t send it to a capabilities team before it’s able to guarantee moral alignment hard enough to make a red-proof yellow node!
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they’re working on in alignment, we’d make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to “well, no I think you’re wrong. This proposal will slow down alignment too much” is why I think we need numbers to ground us.
I don’t really know what people mean when they try to compare “capabilities advancements” to “safety advancements”. In one sense, its pretty clear. The common units are “amount of time”, so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that’s a capabilities advance, and should not have been done. Yet I think there’s a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.
I also don’t often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
Yeah, I agree that releasing open-weights non-frontier models doesn’t seem like a frontier capabilities advance. It does seem potentially like an open-source capabilities advance.
That can be bad in different ways. Let me pose a couple hypotheticals.
What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it’s a dangerous thing to have open-weight models catching up.
What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren’t sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.
So, as we get closer to danger, open-weight models take on more safety significance.
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations.
Obviously such numbers aren’t the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance.
If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn’t exactly my main wheelhouse).
People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe—eg, because they don’t understand in detail how to reason about whether it is or not—are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don’t have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren’t. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don’t personally see how it’s exfohaz. And it won’t be apparent until afterwards that it was capabilities, not alignment.
So just don’t publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god’s sake stop accidentally helping people create green nodes because you can’t see five inches ahead. And don’t send it to a capabilities team before it’s able to guarantee moral alignment hard enough to make a red-proof yellow node!
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they’re working on in alignment, we’d make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to “well, no I think you’re wrong. This proposal will slow down alignment too much” is why I think we need numbers to ground us.