The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.
+ differentially-transparent scaffolding (externalized reasoning-like; e.g. CoT, in-context learning [though I’m a bit more worried about the amount of parallel compute even in one forward pass with very long context windows], [text-y] RAG, many tools, explicit task decomposition, sampling-and-voting), I’d probably add; I suspect this combo adds up to a lot, if e.g. labs were cautious enough and less / no race dynamics, etc. (I think I’m at > 90% you’d get all the way to human obsolescence).
What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It’s not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it’s still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don’t do in any domain.)
I’m not sure how good of an analogy AlphaStar is, given e.g. its specialization, the relatively easy availability of a reward signal, the comparatively much less (including for transfer from close domains) imitation data availability and use (vs. the LLM case). And also, even AlphaStar was bootstrapped with imitation learning.
One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I’m extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)
This review of effective compute gains without retraining might come closest to something like an answer, but it’s been a while since I last looked at it.
FWIW, think a high fraction of the danger from the exact setup I outlined isn’t imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.
I think I (still) largely hold the intuition mentioned here, that deep serial (and recurrent) reasoning in non-interpretable media won’t be (that much more) competitive versus more chain-of-thought-y / tools-y-transparent reasoning, at least before human obsolescence. E.g. based on the [CoT] length complexity—computational complexity tradeoff from Auto-Regressive Next-Token Predictors are Universal Learners and on arguments like those in Before smart AI, there will be many mediocre or specialized AIs, I’d expect the first AIs which can massively speed up AI safety R&D to be probably somewhat subhuman-level in a forward pass (including in terms of serial depth / recurrence) and to compensate for that with CoT, explicit task decompositions, sampling-and-voting, etc. This seems born out by other results too, e.g. More Agents Is All You Need (on sampling-and-voting) or Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks (‘We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results’).
Some thoughts:
+ differentially-transparent scaffolding (externalized reasoning-like; e.g. CoT, in-context learning [though I’m a bit more worried about the amount of parallel compute even in one forward pass with very long context windows], [text-y] RAG, many tools, explicit task decomposition, sampling-and-voting), I’d probably add; I suspect this combo adds up to a lot, if e.g. labs were cautious enough and less / no race dynamics, etc. (I think I’m at > 90% you’d get all the way to human obsolescence).
I’m not sure how good of an analogy AlphaStar is, given e.g. its specialization, the relatively easy availability of a reward signal, the comparatively much less (including for transfer from close domains) imitation data availability and use (vs. the LLM case). And also, even AlphaStar was bootstrapped with imitation learning.
This review of effective compute gains without retraining might come closest to something like an answer, but it’s been a while since I last looked at it.
I think I (still) largely hold the intuition mentioned here, that deep serial (and recurrent) reasoning in non-interpretable media won’t be (that much more) competitive versus more chain-of-thought-y / tools-y-transparent reasoning, at least before human obsolescence. E.g. based on the [CoT] length complexity—computational complexity tradeoff from Auto-Regressive Next-Token Predictors are Universal Learners and on arguments like those in Before smart AI, there will be many mediocre or specialized AIs, I’d expect the first AIs which can massively speed up AI safety R&D to be probably somewhat subhuman-level in a forward pass (including in terms of serial depth / recurrence) and to compensate for that with CoT, explicit task decompositions, sampling-and-voting, etc. This seems born out by other results too, e.g. More Agents Is All You Need (on sampling-and-voting) or Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks (‘We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results’).