First and most important: we can make a weaker model out of a stronger model if we know in advance that we want to do so, and actually try to, and make sure the stronger system does not have a chance to stop us (e.g. we don’t run it). If there’s an agentic superhuman AGI already undergoing takeoff, then “make it weaker” is not really an option. Even if there’s an only-humanish-level agentic AGI already running, if that AGI can easily spin up a new instance of itself without us noticing before we turn it off, or arrange for someone else to spin up a new instance, then “make it weaker” isn’t really an option. Plausibly even a less-than-human-level agent could pull that off; infosec does usually have an attacker’s advantage.
(Subproblem 1: on some-but-not-all threat models, a superhuman AGI is already a threat when it’s in training. So plausibly “don’t run the strong model” wouldn’t even be enough, we’d have to not even train the strong model.
Subproblem 2 (orthogonal to subproblem 1): looking at a strong model and figuring out how aligned/corrigible/etc it is, in a way robust enough to generalize well to even moderately strong capabilities, is itself one of the hardest open problems in alignment. So in order for a plan involving “build strong model and make it weaker” to help, the plan would have to weaken the strong model unconditionally, not check whether the strong model has problems and then weaken it. At which point… why use a stronger model in the first place? There are still some reasons, but a lot fewer.
Put subproblems 1 & 2 together, and we’re basically back to “don’t use a strong model in the first place”—i.e. unconditionally do not train a strong model.)
Second: one would need to know the relevant way in which to weaken the model. “Corrupting n% of its inputs/outputs” just doesn’t matter that much on most threat models I can think of—for instance, it doesn’t really matter at all for deception.
Third: in order for this argument to go through, one does need to actually use the mechanism from the argument, i.e. weaken the stronger model. Without necessarily accusing you specifically of anything, when I hear this argument, my gut expectation is that the arguer’s next step will be to say “great, so let’s assume that alignment gets easier as models get stronger” and then completely forget about the part where their plan is supposed to involve weakening the model somehow. For instance, I could imagine someone next arguing “well, today’s systems are already reasonably aligned, and it only gets easier as models get stronger, so we should be fine!” without realizing/considering that this argument only works insofar as they actually expect all AI labs to intentionally weaken their own models (or do something strictly better for alignment than that, despite subproblem 2 above). So if someone made this argument to me in the context of a broader plan, I’d be on the lookout for that.
(Meta-note: I’m not saying I endorse the premises of all these counterarguments. These are just some counterarguments I see, under some different models.)
I’m curious, do you actually endorse subproblem 1?
Under the current ML paradigm (transformers) the model becoming dangerous during training seems extremely implausible to me.
I could imagine a ML paradigm where subproblem 1 was real (for example, training an RL agent to hack computers and it unsandboxes itself). But it seems like it would be really obvious that you were doing something dangerous beforehand.
I don’t personally expect that subproblem 1, in its purest form, is relevant to the exact LLM architectures used today—i.e. stacked transformers trained mainly on pure text prediction. On the other hand, I’m not extremely highly confident that subproblem 1 isn’t relevant; I wouldn’t particularly want to rely on subproblem 1′s irrelevance as a foundational assumption.
Also, I definitely do not expect that it will be really obvious in advance when someone changes the core architecture enough that subproblem 1 becomes relevant. Really obvious that we’re not just training stacked transformers on pure text prediction, yes. Really obvious that we’re doing something dangerous, no. The space of possibilities is large, and predicting how different setups behave in advance is not easy.
All that said, I do generally consider subproblem 2 the more relevant one.
A few places where this argument breaks down...
First and most important: we can make a weaker model out of a stronger model if we know in advance that we want to do so, and actually try to, and make sure the stronger system does not have a chance to stop us (e.g. we don’t run it). If there’s an agentic superhuman AGI already undergoing takeoff, then “make it weaker” is not really an option. Even if there’s an only-humanish-level agentic AGI already running, if that AGI can easily spin up a new instance of itself without us noticing before we turn it off, or arrange for someone else to spin up a new instance, then “make it weaker” isn’t really an option. Plausibly even a less-than-human-level agent could pull that off; infosec does usually have an attacker’s advantage.
(Subproblem 1: on some-but-not-all threat models, a superhuman AGI is already a threat when it’s in training. So plausibly “don’t run the strong model” wouldn’t even be enough, we’d have to not even train the strong model.
Subproblem 2 (orthogonal to subproblem 1): looking at a strong model and figuring out how aligned/corrigible/etc it is, in a way robust enough to generalize well to even moderately strong capabilities, is itself one of the hardest open problems in alignment. So in order for a plan involving “build strong model and make it weaker” to help, the plan would have to weaken the strong model unconditionally, not check whether the strong model has problems and then weaken it. At which point… why use a stronger model in the first place? There are still some reasons, but a lot fewer.
Put subproblems 1 & 2 together, and we’re basically back to “don’t use a strong model in the first place”—i.e. unconditionally do not train a strong model.)
Second: one would need to know the relevant way in which to weaken the model. “Corrupting n% of its inputs/outputs” just doesn’t matter that much on most threat models I can think of—for instance, it doesn’t really matter at all for deception.
Third: in order for this argument to go through, one does need to actually use the mechanism from the argument, i.e. weaken the stronger model. Without necessarily accusing you specifically of anything, when I hear this argument, my gut expectation is that the arguer’s next step will be to say “great, so let’s assume that alignment gets easier as models get stronger” and then completely forget about the part where their plan is supposed to involve weakening the model somehow. For instance, I could imagine someone next arguing “well, today’s systems are already reasonably aligned, and it only gets easier as models get stronger, so we should be fine!” without realizing/considering that this argument only works insofar as they actually expect all AI labs to intentionally weaken their own models (or do something strictly better for alignment than that, despite subproblem 2 above). So if someone made this argument to me in the context of a broader plan, I’d be on the lookout for that.
(Meta-note: I’m not saying I endorse the premises of all these counterarguments. These are just some counterarguments I see, under some different models.)
I’m curious, do you actually endorse subproblem 1?
Under the current ML paradigm (transformers) the model becoming dangerous during training seems extremely implausible to me.
I could imagine a ML paradigm where subproblem 1 was real (for example, training an RL agent to hack computers and it unsandboxes itself). But it seems like it would be really obvious that you were doing something dangerous beforehand.
I don’t personally expect that subproblem 1, in its purest form, is relevant to the exact LLM architectures used today—i.e. stacked transformers trained mainly on pure text prediction. On the other hand, I’m not extremely highly confident that subproblem 1 isn’t relevant; I wouldn’t particularly want to rely on subproblem 1′s irrelevance as a foundational assumption.
Also, I definitely do not expect that it will be really obvious in advance when someone changes the core architecture enough that subproblem 1 becomes relevant. Really obvious that we’re not just training stacked transformers on pure text prediction, yes. Really obvious that we’re doing something dangerous, no. The space of possibilities is large, and predicting how different setups behave in advance is not easy.
All that said, I do generally consider subproblem 2 the more relevant one.