Here’s an example I’ve been thinking about today to investigate the phenomenon of re-using human models.
Suppose that the “right” way to answer questions is fθ1. And suppose that a human is a learned model Hθ2 trained by gradient descent to approximate fθ1 (subject to architectural and computational constraints). This model is very good on distribution, but we expect it to fail off distribution. We want to train a new neural network to approximate fθ1, without inheriting the human’s off-distribution failures (though the new network may have off-distribution failures of its own).
The problem is that our model needs to learn the exact parameters θ2 for the human model in order to other aspects of human behavior. The simplest case is that we sometimes directly open human brains to observe θ2 directly.
Once we’ve learned θ2 it is very easy to learn the question-answering policy Hθ2. So we’re worried that our model will do that rather than learning the additional parameters θ1 to implement fθ1.
Intuitively there is a strong connection between fθ1 and Hθ2. After all, θ2 is optimized to make them nearly equal on the training distribution. If you understood the dynamics of neural network training it is likely possible to essentially reconstruct fθ1 from θ2, i.e. the complexity fo specifying both θ1 and θ2 is essentially the same as the complexity of specifying only θ2.
But it’s completely unclear how to jointly represent fθ1 and Hθ2 using some parameters θ3 of similar size to θ2. So prima facie there is a strong temptation to just reuse Hθ2.
How much hope is there for jointly representing fθ1 and Hθ2?
The most obvious representation in this case is to first specify fθ1, and then actually model the process of gradient descent that produces Hθ2. This runs into a few problems:
Actually running gradient descent to find θ2 is too expensive to do at every datapoint—instead we learn a hypothesis that does a lot of its work “up front” (shared across all the datapoints). I don’t know what that would look like. The naive ways of doing it (redoing the shared initial computation for every batch) only work for very large batch sizes, which may well be above the critical batch size. If this was the only problem I’d feel pretty optimistic about spending a bunch of time thinking about it.
Specifying “the process that produced Hθ2” requires specifying the initialization ˜θ2, which is as big as θ2. That said, the fact that the learned θ2 also contains a bunch of information about θ1 means that it can’t contain perfect information about the initialization ˜θ2, i.e. that multiple initializations lead to exactly the same final state. So that suggests a possible out: we can start with an “initial” initialization ˜θ02. Then we can learn ˜θ2 by gradient descent. The fact that many different values of ˜θ2 would work suggests that it should be easier to find one of them; intuitively if we set up training just right it seems like we may be able to get all the bits back.
Running the gradient descent to find θ2even a single time may be much more expensive than the rest of training. That is, human learning (perhaps extended over biological or cultural evolution) may take much more time than machine learning. If this the case, then any approach that relies on reproducing that learning is completely doomed.
Similar to problem 2, the mutual information between θ2 and the learning process that producedθ2 also must be kind of low—there are only |θ2| bits of mutual information to go around between the learning process, its initialization, and θ1. But exploiting this structure seems really hard, if actually there aren’t any fast learning processes that lead to the same conclusion
My current take is that even in the case where θ2 was actually produced by something like SGD, we can’t actually exploit that fact to produce a direct, causally-accurate representation θ2.
That’s kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of θ2, as remarked in the sibling comment.)
So then the critical question is not “do we have enough time to reproduce the learning process that lead to θ2?” it is “Can we directly learn Hθ2 as an approximation to fθ1?” If we able to do this in any way, then we can use that to help compress θ2. In the other proposal, we can use it to help estimate the specialness of θ2 in order to determine how many bits we get back—it’s starting to feel like these things aren’t so different anyway.
Fully learning the whole human-model seems impossible—after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with “enough bits” to represent θ2, so that it could memorize them one by one if it saw the brain scans or whatever).
So we could try to do something like “learning just the part of the human policy is that is about answering questions.” But it’s not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
Overall this seems like a pretty tricky case. The high-level summary is something like: “The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is ‘trying’ to do or how they interpret language.” Under these conditions it seems particularly challenging to either jointly represent Hθ2 and fθ1, or to compute how many bits you should “get back” based on a consistency condition between them. I expect it’s going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I’ll be thinking about now.
The difficulty of jointly representing fθ1 and Hθ2 motivates my recent proposal, which avoids any such explicit representation. Instead it separately specifies θ1 and θ2, and then “gets back” bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possible θ2’s (roughly exp(−|θ1|) of them).
But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems:
In order to calculate this loss function we need to evaluate how “special” θ2 is, i.e. how small is the fraction of θ2’s that are consistent with θ1
In order to evaluate how special θ2 is, we basically need to do the same process of SGD that produces—θ2then we can compare the actual iterates to all of the places that it could have gone in a different direction, and conclude that almost all of the different settings of the parameters would have been much less consistent with θ1.
The implicit hope of my proposal is that the outer neural network is learning its human model using something like SGD, and so it can do this specialness-calculation for free—it will be considering lots of different human-models, and it can observe that almost all of them are much less consistent with θ1.
But the outer neural network could learn to model humans in a very different way, which may not involve representing a serious of iterates of “plausible alternative human models.” For example, suppose that in each datapoint we observe a few of the bits of θ2 directly (e.g. by looking at a brain scan), and we fill in much of θ2 in this way before we ever start making good predictions about human behavior. Then we never need to consider any other plausible human-models.
So in order to salvage a proposal like this, it seems like (at a minimum) the “specialness evaluation” needs to take place separately from the main learning of the human model, using a very different process (where we consider lots of different human models and see that it’s actually quite hard to find one that is similarly-consistent with θ1). This would take place at the point where the outer model started actually using its human model Hθ2 in order to answer questions.
I don’t really know what that would look like or if it’s possible to make anything like that work.
Here’s an example I’ve been thinking about today to investigate the phenomenon of re-using human models.
Suppose that the “right” way to answer questions is fθ1. And suppose that a human is a learned model Hθ2 trained by gradient descent to approximate fθ1 (subject to architectural and computational constraints). This model is very good on distribution, but we expect it to fail off distribution. We want to train a new neural network to approximate fθ1, without inheriting the human’s off-distribution failures (though the new network may have off-distribution failures of its own).
The problem is that our model needs to learn the exact parameters θ2 for the human model in order to other aspects of human behavior. The simplest case is that we sometimes directly open human brains to observe θ2 directly.
Once we’ve learned θ2 it is very easy to learn the question-answering policy Hθ2. So we’re worried that our model will do that rather than learning the additional parameters θ1 to implement fθ1.
Intuitively there is a strong connection between fθ1 and Hθ2. After all, θ2 is optimized to make them nearly equal on the training distribution. If you understood the dynamics of neural network training it is likely possible to essentially reconstruct fθ1 from θ2, i.e. the complexity fo specifying both θ1 and θ2 is essentially the same as the complexity of specifying only θ2.
But it’s completely unclear how to jointly represent fθ1 and Hθ2 using some parameters θ3 of similar size to θ2. So prima facie there is a strong temptation to just reuse Hθ2.
How much hope is there for jointly representing fθ1 and Hθ2?
The most obvious representation in this case is to first specify fθ1, and then actually model the process of gradient descent that produces Hθ2. This runs into a few problems:
Actually running gradient descent to find θ2 is too expensive to do at every datapoint—instead we learn a hypothesis that does a lot of its work “up front” (shared across all the datapoints). I don’t know what that would look like. The naive ways of doing it (redoing the shared initial computation for every batch) only work for very large batch sizes, which may well be above the critical batch size. If this was the only problem I’d feel pretty optimistic about spending a bunch of time thinking about it.
Specifying “the process that produced Hθ2” requires specifying the initialization ˜θ2, which is as big as θ2. That said, the fact that the learned θ2 also contains a bunch of information about θ1 means that it can’t contain perfect information about the initialization ˜θ2, i.e. that multiple initializations lead to exactly the same final state. So that suggests a possible out: we can start with an “initial” initialization ˜θ02. Then we can learn ˜θ2 by gradient descent. The fact that many different values of ˜θ2 would work suggests that it should be easier to find one of them; intuitively if we set up training just right it seems like we may be able to get all the bits back.
Running the gradient descent to find θ2 even a single time may be much more expensive than the rest of training. That is, human learning (perhaps extended over biological or cultural evolution) may take much more time than machine learning. If this the case, then any approach that relies on reproducing that learning is completely doomed.
Similar to problem 2, the mutual information between θ2 and the learning process that produced θ2 also must be kind of low—there are only |θ2| bits of mutual information to go around between the learning process, its initialization, and θ1. But exploiting this structure seems really hard, if actually there aren’t any fast learning processes that lead to the same conclusion
My current take is that even in the case where θ2 was actually produced by something like SGD, we can’t actually exploit that fact to produce a direct, causally-accurate representation θ2.
That’s kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of θ2, as remarked in the sibling comment.)
So then the critical question is not “do we have enough time to reproduce the learning process that lead to θ2?” it is “Can we directly learn Hθ2 as an approximation to fθ1?” If we able to do this in any way, then we can use that to help compress θ2. In the other proposal, we can use it to help estimate the specialness of θ2 in order to determine how many bits we get back—it’s starting to feel like these things aren’t so different anyway.
Fully learning the whole human-model seems impossible—after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with “enough bits” to represent θ2, so that it could memorize them one by one if it saw the brain scans or whatever).
So we could try to do something like “learning just the part of the human policy is that is about answering questions.” But it’s not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
Overall this seems like a pretty tricky case. The high-level summary is something like: “The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is ‘trying’ to do or how they interpret language.” Under these conditions it seems particularly challenging to either jointly represent Hθ2 and fθ1, or to compute how many bits you should “get back” based on a consistency condition between them. I expect it’s going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I’ll be thinking about now.
The difficulty of jointly representing fθ1 and Hθ2 motivates my recent proposal, which avoids any such explicit representation. Instead it separately specifies θ1 and θ2, and then “gets back” bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possible θ2’s (roughly exp(−|θ1|) of them).
But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems:
In order to calculate this loss function we need to evaluate how “special” θ2 is, i.e. how small is the fraction of θ2’s that are consistent with θ1
In order to evaluate how special θ2 is, we basically need to do the same process of SGD that produces—θ2then we can compare the actual iterates to all of the places that it could have gone in a different direction, and conclude that almost all of the different settings of the parameters would have been much less consistent with θ1.
The implicit hope of my proposal is that the outer neural network is learning its human model using something like SGD, and so it can do this specialness-calculation for free—it will be considering lots of different human-models, and it can observe that almost all of them are much less consistent with θ1.
But the outer neural network could learn to model humans in a very different way, which may not involve representing a serious of iterates of “plausible alternative human models.” For example, suppose that in each datapoint we observe a few of the bits of θ2 directly (e.g. by looking at a brain scan), and we fill in much of θ2 in this way before we ever start making good predictions about human behavior. Then we never need to consider any other plausible human-models.
So in order to salvage a proposal like this, it seems like (at a minimum) the “specialness evaluation” needs to take place separately from the main learning of the human model, using a very different process (where we consider lots of different human models and see that it’s actually quite hard to find one that is similarly-consistent with θ1). This would take place at the point where the outer model started actually using its human model Hθ2 in order to answer questions.
I don’t really know what that would look like or if it’s possible to make anything like that work.