Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.
I think you get distributional shift in the mental configurations that you have access to when you run for more steps, and this means that for the ML to line up with the ground truth you either need training data from those regions of configuration-space or you need well-characterized dynamics that you could correctly identify by training on 100,000 steps.
Arithmetic has these well-characterized dynamics, for example; if you have the right architecture and train on small multiplication problems, you can also perform well on big multiplication problems, because the underlying steps are the same, just repeated more times. This isn’t true of piecewise linear approximations to complicated functions, as your approximations will only be good in regions where you had lots of training data. (Imagine trying to fit x^3 with a random forest.) If there are different ‘modes of thought’ that humans can employ, you need either complete coverage of those modes of thought or ‘functional coverage’, in that the response to any strange new mental configurations you enter can be easily predicted from normal mental configurations you saw in training.
Like, consider moving in the opposite direction; if I have a model that I train on a single step from questions, then I probably just have a model that’s able to ‘read’ questions (or even just the starts of questions). Once I want to extend this to doing 100,000 steps, I need to not just be able to read inputs, but also do something interesting with them, which probably requires not just ‘more’ training data from the same distribution, but data from a different, more general distribution.
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?
I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest x such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for x minutes vs. 1,000 days.
Or rather: What is the smallest x such that ‘learning to generate answers that humans may give after thinking for x minutes’ is not easier than ‘learning to generate answers that humans may give after thinking for 1,000 days’.
I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either ‘supervised with automatic labeling’ or unsupervised (e.g. ‘predict the next token’) and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).
I think you get distributional shift in the mental configurations that you have access to when you run for more steps, and this means that for the ML to line up with the ground truth you either need training data from those regions of configuration-space or you need well-characterized dynamics that you could correctly identify by training on 100,000 steps.
Arithmetic has these well-characterized dynamics, for example; if you have the right architecture and train on small multiplication problems, you can also perform well on big multiplication problems, because the underlying steps are the same, just repeated more times. This isn’t true of piecewise linear approximations to complicated functions, as your approximations will only be good in regions where you had lots of training data. (Imagine trying to fit x^3 with a random forest.) If there are different ‘modes of thought’ that humans can employ, you need either complete coverage of those modes of thought or ‘functional coverage’, in that the response to any strange new mental configurations you enter can be easily predicted from normal mental configurations you saw in training.
Like, consider moving in the opposite direction; if I have a model that I train on a single step from questions, then I probably just have a model that’s able to ‘read’ questions (or even just the starts of questions). Once I want to extend this to doing 100,000 steps, I need to not just be able to read inputs, but also do something interesting with them, which probably requires not just ‘more’ training data from the same distribution, but data from a different, more general distribution.
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?
I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest x such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for x minutes vs. 1,000 days.
Or rather: What is the smallest x such that ‘learning to generate answers that humans may give after thinking for x minutes’ is not easier than ‘learning to generate answers that humans may give after thinking for 1,000 days’.
I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either ‘supervised with automatic labeling’ or unsupervised (e.g. ‘predict the next token’) and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).