The incremental approach bakes in a few assumptions, namely that there likely won’t be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don’t know that this will hold, and that there’s reason to suspect it won’t. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.
I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”
Control doesn’t assume “no sharp left turns” and “the work you do on smaller systems will apply to the next ones you build”, though things along these lines do help with control.
(For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.)
This might depend on exactly what you mean by “sharp left turn”, I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control.
The two assumptions we need that could be broken by something like “sharp left turn”:
Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion.
Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies “very low sample efficiency because capabilities now come from a non-retargetable process”, then this would cause problems.
In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8).
The incremental approach bakes in a few assumptions, namely that there likely won’t be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don’t know that this will hold, and that there’s reason to suspect it won’t. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.
I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”
Control doesn’t assume “no sharp left turns” and “the work you do on smaller systems will apply to the next ones you build”, though things along these lines do help with control.
(For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.)
This might depend on exactly what you mean by “sharp left turn”, I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control.
The two assumptions we need that could be broken by something like “sharp left turn”:
Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion.
Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies “very low sample efficiency because capabilities now come from a non-retargetable process”, then this would cause problems. In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8).