“Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances.”
My intuition goes something like: this doesn’t matter that much if e.g. it happens (sufficiently) after you’d get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning. And I’d expect, e.g. based on current scaling laws, but also on theoretical arguments about the difficulty of imitation learning vs. of RL, that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level. Then, the closer you get to ~human-level automated AI safety R&D with just imitation learning the less of a ‘gap’ you’d need to ‘cover for’ with e.g. RL. And the less RL fine-tuning you might need, the less likely it might be that the weights / representations change much (e.g. they don’t seem to change much with current DPO). This might all be conceptually operationalizable in terms of effective compute.
this doesn’t matter that much if e.g. it happens (sufficiently) after you’d get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning.
Yep. The way I would put this:
It barely matters if you transition to this sort of architecture well after human obsolescence.
The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.
I’d expect [...] that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level
What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It’s not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it’s still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don’t do in any domain.)
Also, note that even if there is a massive amount of RL, it could still be the case that most of the learning is from imitation (or that most of the learning is from self-supervised (e.g. prediction) objectives which are part of RL).
This might all be conceptually operationalizable in terms of effective compute.
One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I’m extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)
safer setups, e.g. imitation learning and no/less RL fine-tuning
FWIW, think a high fraction of the danger from the exact setup I outlined isn’t imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.
The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.
+ differentially-transparent scaffolding (externalized reasoning-like; e.g. CoT, in-context learning [though I’m a bit more worried about the amount of parallel compute even in one forward pass with very long context windows], [text-y] RAG, many tools, explicit task decomposition, sampling-and-voting), I’d probably add; I suspect this combo adds up to a lot, if e.g. labs were cautious enough and less / no race dynamics, etc. (I think I’m at > 90% you’d get all the way to human obsolescence).
What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It’s not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it’s still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don’t do in any domain.)
I’m not sure how good of an analogy AlphaStar is, given e.g. its specialization, the relatively easy availability of a reward signal, the comparatively much less (including for transfer from close domains) imitation data availability and use (vs. the LLM case). And also, even AlphaStar was bootstrapped with imitation learning.
One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I’m extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)
This review of effective compute gains without retraining might come closest to something like an answer, but it’s been a while since I last looked at it.
FWIW, think a high fraction of the danger from the exact setup I outlined isn’t imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.
I think I (still) largely hold the intuition mentioned here, that deep serial (and recurrent) reasoning in non-interpretable media won’t be (that much more) competitive versus more chain-of-thought-y / tools-y-transparent reasoning, at least before human obsolescence. E.g. based on the [CoT] length complexity—computational complexity tradeoff from Auto-Regressive Next-Token Predictors are Universal Learners and on arguments like those in Before smart AI, there will be many mediocre or specialized AIs, I’d expect the first AIs which can massively speed up AI safety R&D to be probably somewhat subhuman-level in a forward pass (including in terms of serial depth / recurrence) and to compensate for that with CoT, explicit task decompositions, sampling-and-voting, etc. This seems born out by other results too, e.g. More Agents Is All You Need (on sampling-and-voting) or Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks (‘We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results’).
My intuition goes something like: this doesn’t matter that much if e.g. it happens (sufficiently) after you’d get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning. And I’d expect, e.g. based on current scaling laws, but also on theoretical arguments about the difficulty of imitation learning vs. of RL, that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level. Then, the closer you get to ~human-level automated AI safety R&D with just imitation learning the less of a ‘gap’ you’d need to ‘cover for’ with e.g. RL. And the less RL fine-tuning you might need, the less likely it might be that the weights / representations change much (e.g. they don’t seem to change much with current DPO). This might all be conceptually operationalizable in terms of effective compute.
Currently, most capabilities indeed seem to come from pre-training, and fine-tuning only seems to ‘steer’ them / ‘wrap them around’; to the degree that even in-context learning can be competitive at this steering; similarly, ‘on understanding how reasoning emerges from language model pre-training’.
Yep. The way I would put this:
It barely matters if you transition to this sort of architecture well after human obsolescence.
The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.
What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It’s not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it’s still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don’t do in any domain.)
Also, note that even if there is a massive amount of RL, it could still be the case that most of the learning is from imitation (or that most of the learning is from self-supervised (e.g. prediction) objectives which are part of RL).
One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I’m extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)
FWIW, think a high fraction of the danger from the exact setup I outlined isn’t imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.
Some thoughts:
+ differentially-transparent scaffolding (externalized reasoning-like; e.g. CoT, in-context learning [though I’m a bit more worried about the amount of parallel compute even in one forward pass with very long context windows], [text-y] RAG, many tools, explicit task decomposition, sampling-and-voting), I’d probably add; I suspect this combo adds up to a lot, if e.g. labs were cautious enough and less / no race dynamics, etc. (I think I’m at > 90% you’d get all the way to human obsolescence).
I’m not sure how good of an analogy AlphaStar is, given e.g. its specialization, the relatively easy availability of a reward signal, the comparatively much less (including for transfer from close domains) imitation data availability and use (vs. the LLM case). And also, even AlphaStar was bootstrapped with imitation learning.
This review of effective compute gains without retraining might come closest to something like an answer, but it’s been a while since I last looked at it.
I think I (still) largely hold the intuition mentioned here, that deep serial (and recurrent) reasoning in non-interpretable media won’t be (that much more) competitive versus more chain-of-thought-y / tools-y-transparent reasoning, at least before human obsolescence. E.g. based on the [CoT] length complexity—computational complexity tradeoff from Auto-Regressive Next-Token Predictors are Universal Learners and on arguments like those in Before smart AI, there will be many mediocre or specialized AIs, I’d expect the first AIs which can massively speed up AI safety R&D to be probably somewhat subhuman-level in a forward pass (including in terms of serial depth / recurrence) and to compensate for that with CoT, explicit task decompositions, sampling-and-voting, etc. This seems born out by other results too, e.g. More Agents Is All You Need (on sampling-and-voting) or Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks (‘We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results’).