How much hope is there for jointly representing fθ1 and Hθ2?
The most obvious representation in this case is to first specify fθ1, and then actually model the process of gradient descent that produces Hθ2. This runs into a few problems:
Actually running gradient descent to find θ2 is too expensive to do at every datapoint—instead we learn a hypothesis that does a lot of its work “up front” (shared across all the datapoints). I don’t know what that would look like. The naive ways of doing it (redoing the shared initial computation for every batch) only work for very large batch sizes, which may well be above the critical batch size. If this was the only problem I’d feel pretty optimistic about spending a bunch of time thinking about it.
Specifying “the process that produced Hθ2” requires specifying the initialization ˜θ2, which is as big as θ2. That said, the fact that the learned θ2 also contains a bunch of information about θ1 means that it can’t contain perfect information about the initialization ˜θ2, i.e. that multiple initializations lead to exactly the same final state. So that suggests a possible out: we can start with an “initial” initialization ˜θ02. Then we can learn ˜θ2 by gradient descent. The fact that many different values of ˜θ2 would work suggests that it should be easier to find one of them; intuitively if we set up training just right it seems like we may be able to get all the bits back.
Running the gradient descent to find θ2even a single time may be much more expensive than the rest of training. That is, human learning (perhaps extended over biological or cultural evolution) may take much more time than machine learning. If this the case, then any approach that relies on reproducing that learning is completely doomed.
Similar to problem 2, the mutual information between θ2 and the learning process that producedθ2 also must be kind of low—there are only |θ2| bits of mutual information to go around between the learning process, its initialization, and θ1. But exploiting this structure seems really hard, if actually there aren’t any fast learning processes that lead to the same conclusion
My current take is that even in the case where θ2 was actually produced by something like SGD, we can’t actually exploit that fact to produce a direct, causally-accurate representation θ2.
That’s kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of θ2, as remarked in the sibling comment.)
So then the critical question is not “do we have enough time to reproduce the learning process that lead to θ2?” it is “Can we directly learn Hθ2 as an approximation to fθ1?” If we able to do this in any way, then we can use that to help compress θ2. In the other proposal, we can use it to help estimate the specialness of θ2 in order to determine how many bits we get back—it’s starting to feel like these things aren’t so different anyway.
Fully learning the whole human-model seems impossible—after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with “enough bits” to represent θ2, so that it could memorize them one by one if it saw the brain scans or whatever).
So we could try to do something like “learning just the part of the human policy is that is about answering questions.” But it’s not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
Overall this seems like a pretty tricky case. The high-level summary is something like: “The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is ‘trying’ to do or how they interpret language.” Under these conditions it seems particularly challenging to either jointly represent Hθ2 and fθ1, or to compute how many bits you should “get back” based on a consistency condition between them. I expect it’s going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I’ll be thinking about now.
How much hope is there for jointly representing fθ1 and Hθ2?
The most obvious representation in this case is to first specify fθ1, and then actually model the process of gradient descent that produces Hθ2. This runs into a few problems:
Actually running gradient descent to find θ2 is too expensive to do at every datapoint—instead we learn a hypothesis that does a lot of its work “up front” (shared across all the datapoints). I don’t know what that would look like. The naive ways of doing it (redoing the shared initial computation for every batch) only work for very large batch sizes, which may well be above the critical batch size. If this was the only problem I’d feel pretty optimistic about spending a bunch of time thinking about it.
Specifying “the process that produced Hθ2” requires specifying the initialization ˜θ2, which is as big as θ2. That said, the fact that the learned θ2 also contains a bunch of information about θ1 means that it can’t contain perfect information about the initialization ˜θ2, i.e. that multiple initializations lead to exactly the same final state. So that suggests a possible out: we can start with an “initial” initialization ˜θ02. Then we can learn ˜θ2 by gradient descent. The fact that many different values of ˜θ2 would work suggests that it should be easier to find one of them; intuitively if we set up training just right it seems like we may be able to get all the bits back.
Running the gradient descent to find θ2 even a single time may be much more expensive than the rest of training. That is, human learning (perhaps extended over biological or cultural evolution) may take much more time than machine learning. If this the case, then any approach that relies on reproducing that learning is completely doomed.
Similar to problem 2, the mutual information between θ2 and the learning process that produced θ2 also must be kind of low—there are only |θ2| bits of mutual information to go around between the learning process, its initialization, and θ1. But exploiting this structure seems really hard, if actually there aren’t any fast learning processes that lead to the same conclusion
My current take is that even in the case where θ2 was actually produced by something like SGD, we can’t actually exploit that fact to produce a direct, causally-accurate representation θ2.
That’s kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of θ2, as remarked in the sibling comment.)
So then the critical question is not “do we have enough time to reproduce the learning process that lead to θ2?” it is “Can we directly learn Hθ2 as an approximation to fθ1?” If we able to do this in any way, then we can use that to help compress θ2. In the other proposal, we can use it to help estimate the specialness of θ2 in order to determine how many bits we get back—it’s starting to feel like these things aren’t so different anyway.
Fully learning the whole human-model seems impossible—after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with “enough bits” to represent θ2, so that it could memorize them one by one if it saw the brain scans or whatever).
So we could try to do something like “learning just the part of the human policy is that is about answering questions.” But it’s not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
Overall this seems like a pretty tricky case. The high-level summary is something like: “The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is ‘trying’ to do or how they interpret language.” Under these conditions it seems particularly challenging to either jointly represent Hθ2 and fθ1, or to compute how many bits you should “get back” based on a consistency condition between them. I expect it’s going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I’ll be thinking about now.