Copying over a response I wrote on Twitter to Emmett Shear, who argued that “it’s just a bad way to solve the problem. An ever more powerful and sophisticated enemy? … If the process continues you just lose eventually”.
I think there are (at least) two strong reasons to like this approach:
1. It’s complementary with alignment.
2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…
As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.
This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence threshold at which bad things happen.
This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.
You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.
This is a great comment, especially about P(Doom) being somewhat counterproductive because it is nonlocal and I think this deserves some signal boosting
The incremental approach bakes in a few assumptions, namely that there likely won’t be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don’t know that this will hold, and that there’s reason to suspect it won’t. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.
I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”
Control doesn’t assume “no sharp left turns” and “the work you do on smaller systems will apply to the next ones you build”, though things along these lines do help with control.
(For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.)
This might depend on exactly what you mean by “sharp left turn”, I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control.
The two assumptions we need that could be broken by something like “sharp left turn”:
Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion.
Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies “very low sample efficiency because capabilities now come from a non-retargetable process”, then this would cause problems.
In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8).
Copying over a response I wrote on Twitter to Emmett Shear, who argued that “it’s just a bad way to solve the problem. An ever more powerful and sophisticated enemy? … If the process continues you just lose eventually”.
I think there are (at least) two strong reasons to like this approach:
1. It’s complementary with alignment.
2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…
As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.
This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence threshold at which bad things happen.
This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.
You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.
This is a great comment, especially about P(Doom) being somewhat counterproductive because it is nonlocal and I think this deserves some signal boosting
The incremental approach bakes in a few assumptions, namely that there likely won’t be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don’t know that this will hold, and that there’s reason to suspect it won’t. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.
I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”
Control doesn’t assume “no sharp left turns” and “the work you do on smaller systems will apply to the next ones you build”, though things along these lines do help with control.
(For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.)
This might depend on exactly what you mean by “sharp left turn”, I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control.
The two assumptions we need that could be broken by something like “sharp left turn”:
Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion.
Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies “very low sample efficiency because capabilities now come from a non-retargetable process”, then this would cause problems. In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8).