Thanks for sharing this draft! I’m going to try to make lots of different comments as I go along, rather than one huge comment.
[edit: page 10 calls this the “most important thread of further research”; the downside of writing as I go! For posterity’s sake, I’ll leave the comment.]
Pages 8 and 9 of part 1 talk about “effective horizon length”, and make the claim:
Prima facie, I would expect that if we modify an ML problem so that effective horizon length is doubled (i.e, it takes twice as much data on average to reach a certain level of confidence about whether a perturbation to the model improved performance), the total training data required to train a model would also double. That is, I would expect training data requirements to scale linearly with effective horizon length as I have defined it.
I’m curious where ‘linearly’ came from; my sense is that “effective horizon length” is the equivalent of “optimal batch size”, which I would have expected to be a weirder function of training data size than ‘linear’. I don’t have a great handle on the ML theory here, tho, and it might be substantially different between classification (where I can make batch-of-the-envelope estimates for this sort of thing) and RL (where it feels like it’s a component of a much trickier system with harder-to-predict connections).
Quite possibly you talked with some ML experts and their sense was “linearly”, and it makes sense to roll with that; it also seems quite possible that the thing to do here is have uncertainty over functional forms. That is, maybe the effective horizon scales linearly, or maybe it scales exponentially, or maybe it scales logarithmically, or inverse square root, or whatever. This would help double-check that the assumption of linearity isn’t doing significant work, and if it is, point to a potentially promising avenue of theoretical ML research.
[As a broader point, I think this ‘functional form uncertainty’ is a big deal for my timelines estimates. A lot of people (rightfully!) dismissed the standard RL algorithms of 5 years ago for making AGI because of exponential training data requirements, but my sense is that further algorithmic improvement is mostly not “it’s 10% faster” but “the base of the exponent is smaller” or “it’s no longer exponential.”, which might change whether or not it makes sense to dismiss it.]
Thanks! Agree that functional form uncertainty is a big deal here; I think that implicitly this uncertainty is causing me to up-weight Short Horizon Neural Network more than I otherwise would, and also up-weight “Larger than all hypotheses” more than I otherwise would.
With that said, I do predict that in clean artificial cases (which may or may not be relevant), we could demonstrate linear scaling. E.g., consider the case of inserting a frame of static or a blank screen in between every normal frame of an Atari game or StarCraft game—I’d expect that modifying the games in this way would straightforwardly double training computation requirements.
Thanks for sharing this draft! I’m going to try to make lots of different comments as I go along, rather than one huge comment.
[edit: page 10 calls this the “most important thread of further research”; the downside of writing as I go! For posterity’s sake, I’ll leave the comment.]
Pages 8 and 9 of part 1 talk about “effective horizon length”, and make the claim:
I’m curious where ‘linearly’ came from; my sense is that “effective horizon length” is the equivalent of “optimal batch size”, which I would have expected to be a weirder function of training data size than ‘linear’. I don’t have a great handle on the ML theory here, tho, and it might be substantially different between classification (where I can make batch-of-the-envelope estimates for this sort of thing) and RL (where it feels like it’s a component of a much trickier system with harder-to-predict connections).
Quite possibly you talked with some ML experts and their sense was “linearly”, and it makes sense to roll with that; it also seems quite possible that the thing to do here is have uncertainty over functional forms. That is, maybe the effective horizon scales linearly, or maybe it scales exponentially, or maybe it scales logarithmically, or inverse square root, or whatever. This would help double-check that the assumption of linearity isn’t doing significant work, and if it is, point to a potentially promising avenue of theoretical ML research.
[As a broader point, I think this ‘functional form uncertainty’ is a big deal for my timelines estimates. A lot of people (rightfully!) dismissed the standard RL algorithms of 5 years ago for making AGI because of exponential training data requirements, but my sense is that further algorithmic improvement is mostly not “it’s 10% faster” but “the base of the exponent is smaller” or “it’s no longer exponential.”, which might change whether or not it makes sense to dismiss it.]
Thanks! Agree that functional form uncertainty is a big deal here; I think that implicitly this uncertainty is causing me to up-weight Short Horizon Neural Network more than I otherwise would, and also up-weight “Larger than all hypotheses” more than I otherwise would.
With that said, I do predict that in clean artificial cases (which may or may not be relevant), we could demonstrate linear scaling. E.g., consider the case of inserting a frame of static or a blank screen in between every normal frame of an Atari game or StarCraft game—I’d expect that modifying the games in this way would straightforwardly double training computation requirements.