Suppose that at some point in the future, for the first time in history, someone trains an ML model that takes any question as input, and outputs an answer that an ~average human might have given after thinking about it for 10 minutes. Suppose that model is trained without any safety-motivated interventions.
Suppose also that the architecture of that model is such that ’10 minutes’ is just a parameter, t, that the operator can choose per inference, and there’s no upper bound on it; and the inference runtime increases linearly with t. So, for example, the model could be used to get an answer that a human would have come up with after thinking for 1000 days.
In this scenario, would it make sense to use the model for factored cognition? Or should we consider running this model with t=1000days to be no more dangerous than running it many times with t=10minutes?
I think normally the time restriction is used as a cost-saving measure instead of as a safety-enhancing measure. That is, if we need a billion examples to train the system on, it’s easier to get a billion examples of people thinking for 10 minutes than it is to get a billion examples of them thinking for 1000 days. It just needs to be long enough that meaningful cognitive work gets done (and so you aren’t spending all of your time loading in or loading out context).
The bit of this that is a safety enhancing measure is that you have an honesty criterion, where the training procedure should be only trying to answer the question posed, and not trying to do anything else, and not trying to pass coded messages in the answer, and so on. This is more important than a runtime limit, since those could be circumvented by passing coded messages / whatever other thing that lets you use more than one cell.
I think also this counts as the ‘Mechanical Turk’ case from this post, which I don’t think people are optimistic about (from an alignment perspective).
---
I think this thought experiment asks us to condition on how thinking works in a way that makes it a little weirder than you’re expecting. If humans naturally use something like factored cognition, then it doesn’t make much difference whether I use my 1000 days of runtime to ask my big question to a ‘single run’, or ask my big question to a network of ‘small runs’ using a factored cognition scheme. But if humans normally do something very different from factored cognition, then it makes a pretty big difference (suppose the costs for serializing and deserializing state are high, but doing it regularly gives us useful transparency / not-spawning-subsystems guarantees); the ‘humans thinking for 10 minutes chained together’ might have very different properties from ‘one human thinking for 1000 days’. But given that those have different properties, it means it might be hard to train the ‘one human thinking for 1000 days’ system relative to the ‘thinking for 10 minutes’ system, and the fact that one easily extends to the other is evidence that this isn’t how thinking works.
Thank you! I think I now have a better model of how people think about factored cognition.
the ‘humans thinking for 10 minutes chained together’ might have very different properties from ‘one human thinking for 1000 days’. But given that those have different properties, it means it might be hard to train the ‘one human thinking for 1000 days’ system relative to the ‘thinking for 10 minutes’ system, and the fact that one easily extends to the other is evidence that this isn’t how thinking works.
In the above scenario I didn’t assume that humans—or the model—use factored cognition when the ‘thinking duration’ is long. Suppose instead that the model is running a simulation of a system that is similar (in some level of abstraction) to a human brain. For example, suppose some part of the model represents a configuration of a human brain, and during inference some iterative process repeatedly advances that configuration by a single “step”. Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.
Generally, one way to make predictions about the final state of complicated physical processes is to simulate them. Solutions that do not involve simulations (or equivalent) may not even exist, or may be less likely to be found by the training algorithm.
Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.
I think you get distributional shift in the mental configurations that you have access to when you run for more steps, and this means that for the ML to line up with the ground truth you either need training data from those regions of configuration-space or you need well-characterized dynamics that you could correctly identify by training on 100,000 steps.
Arithmetic has these well-characterized dynamics, for example; if you have the right architecture and train on small multiplication problems, you can also perform well on big multiplication problems, because the underlying steps are the same, just repeated more times. This isn’t true of piecewise linear approximations to complicated functions, as your approximations will only be good in regions where you had lots of training data. (Imagine trying to fit x^3 with a random forest.) If there are different ‘modes of thought’ that humans can employ, you need either complete coverage of those modes of thought or ‘functional coverage’, in that the response to any strange new mental configurations you enter can be easily predicted from normal mental configurations you saw in training.
Like, consider moving in the opposite direction; if I have a model that I train on a single step from questions, then I probably just have a model that’s able to ‘read’ questions (or even just the starts of questions). Once I want to extend this to doing 100,000 steps, I need to not just be able to read inputs, but also do something interesting with them, which probably requires not just ‘more’ training data from the same distribution, but data from a different, more general distribution.
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?
I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest x such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for x minutes vs. 1,000 days.
Or rather: What is the smallest x such that ‘learning to generate answers that humans may give after thinking for x minutes’ is not easier than ‘learning to generate answers that humans may give after thinking for 1,000 days’.
I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either ‘supervised with automatic labeling’ or unsupervised (e.g. ‘predict the next token’) and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).
[Question about factored cognition]
Suppose that at some point in the future, for the first time in history, someone trains an ML model that takes any question as input, and outputs an answer that an ~average human might have given after thinking about it for 10 minutes. Suppose that model is trained without any safety-motivated interventions.
Suppose also that the architecture of that model is such that ’10 minutes’ is just a parameter, t, that the operator can choose per inference, and there’s no upper bound on it; and the inference runtime increases linearly with t. So, for example, the model could be used to get an answer that a human would have come up with after thinking for 1000 days.
In this scenario, would it make sense to use the model for factored cognition? Or should we consider running this model with t=1000days to be no more dangerous than running it many times with t=10minutes?
I think normally the time restriction is used as a cost-saving measure instead of as a safety-enhancing measure. That is, if we need a billion examples to train the system on, it’s easier to get a billion examples of people thinking for 10 minutes than it is to get a billion examples of them thinking for 1000 days. It just needs to be long enough that meaningful cognitive work gets done (and so you aren’t spending all of your time loading in or loading out context).
The bit of this that is a safety enhancing measure is that you have an honesty criterion, where the training procedure should be only trying to answer the question posed, and not trying to do anything else, and not trying to pass coded messages in the answer, and so on. This is more important than a runtime limit, since those could be circumvented by passing coded messages / whatever other thing that lets you use more than one cell.
I think also this counts as the ‘Mechanical Turk’ case from this post, which I don’t think people are optimistic about (from an alignment perspective).
---
I think this thought experiment asks us to condition on how thinking works in a way that makes it a little weirder than you’re expecting. If humans naturally use something like factored cognition, then it doesn’t make much difference whether I use my 1000 days of runtime to ask my big question to a ‘single run’, or ask my big question to a network of ‘small runs’ using a factored cognition scheme. But if humans normally do something very different from factored cognition, then it makes a pretty big difference (suppose the costs for serializing and deserializing state are high, but doing it regularly gives us useful transparency / not-spawning-subsystems guarantees); the ‘humans thinking for 10 minutes chained together’ might have very different properties from ‘one human thinking for 1000 days’. But given that those have different properties, it means it might be hard to train the ‘one human thinking for 1000 days’ system relative to the ‘thinking for 10 minutes’ system, and the fact that one easily extends to the other is evidence that this isn’t how thinking works.
Thank you! I think I now have a better model of how people think about factored cognition.
In the above scenario I didn’t assume that humans—or the model—use factored cognition when the ‘thinking duration’ is long. Suppose instead that the model is running a simulation of a system that is similar (in some level of abstraction) to a human brain. For example, suppose some part of the model represents a configuration of a human brain, and during inference some iterative process repeatedly advances that configuration by a single “step”. Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.
Generally, one way to make predictions about the final state of complicated physical processes is to simulate them. Solutions that do not involve simulations (or equivalent) may not even exist, or may be less likely to be found by the training algorithm.
I think you get distributional shift in the mental configurations that you have access to when you run for more steps, and this means that for the ML to line up with the ground truth you either need training data from those regions of configuration-space or you need well-characterized dynamics that you could correctly identify by training on 100,000 steps.
Arithmetic has these well-characterized dynamics, for example; if you have the right architecture and train on small multiplication problems, you can also perform well on big multiplication problems, because the underlying steps are the same, just repeated more times. This isn’t true of piecewise linear approximations to complicated functions, as your approximations will only be good in regions where you had lots of training data. (Imagine trying to fit x^3 with a random forest.) If there are different ‘modes of thought’ that humans can employ, you need either complete coverage of those modes of thought or ‘functional coverage’, in that the response to any strange new mental configurations you enter can be easily predicted from normal mental configurations you saw in training.
Like, consider moving in the opposite direction; if I have a model that I train on a single step from questions, then I probably just have a model that’s able to ‘read’ questions (or even just the starts of questions). Once I want to extend this to doing 100,000 steps, I need to not just be able to read inputs, but also do something interesting with them, which probably requires not just ‘more’ training data from the same distribution, but data from a different, more general distribution.
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?
I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest x such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for x minutes vs. 1,000 days.
Or rather: What is the smallest x such that ‘learning to generate answers that humans may give after thinking for x minutes’ is not easier than ‘learning to generate answers that humans may give after thinking for 1,000 days’.
I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either ‘supervised with automatic labeling’ or unsupervised (e.g. ‘predict the next token’) and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).