paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 5 Jul 2021 19:50 UTC
LW: 5 AF: 3
AF
How much hope is there for jointly representing $f_{θ_{1}}$ and $H_{θ_{2}}$ ?
The most obvious representation in this case is to first specify $f_{θ_{1}}$ , and then actually model the process of gradient descent that produces $H_{θ_{2}}$ . This runs into a few problems:
1. Actually running gradient descent to find $θ_{2}$ is too expensive to do at every datapoint—instead we learn a hypothesis that does a lot of its work “up front” (shared across all the datapoints). I don’t know what that would look like. The naive ways of doing it (redoing the shared initial computation for every batch) only work for very large batch sizes, which may well be above the critical batch size. If this was the only problem I’d feel pretty optimistic about spending a bunch of time thinking about it.
2. Specifying “the process that produced $H_{θ_{2}}$ ” requires specifying the initialization ${˜ θ}_{2}$ , which is as big as $θ_{2}$ . That said, the fact that the learned $θ_{2}$ also contains a bunch of information about $θ_{1}$ means that it can’t contain perfect information about the initialization ${˜ θ}_{2}$ , i.e. that multiple initializations lead to exactly the same final state. So that suggests a possible out: we can start with an “initial” initialization ${˜ θ}_{2}^{0}$ . Then we can learn ${˜ θ}_{2}$ by gradient descent. The fact that many different values of ${˜ θ}_{2}$ would work suggests that it should be easier to find one of them; intuitively if we set up training just right it seems like we may be able to get all the bits back.
3. Running the gradient descent to find $θ_{2}$ even a single time may be much more expensive than the rest of training. That is, human learning (perhaps extended over biological or cultural evolution) may take much more time than machine learning. If this the case, then any approach that relies on reproducing that learning is completely doomed.
  
  Similar to problem 2, the mutual information between $θ_{2}$ and the learning process that produced $θ_{2}$ also must be kind of low—there are only $| θ_{2} |$ bits of mutual information to go around between the learning process, its initialization, and $θ_{1}$ . But exploiting this structure seems really hard, if actually there aren’t any fast learning processes that lead to the same conclusion
My current take is that even in the case where $θ_{2}$ was actually produced by something like SGD, we can’t actually exploit that fact to produce a direct, causally-accurate representation $θ_{2}$ .
That’s kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of $θ_{2}$ , as remarked in the sibling comment.)
So then the critical question is not “do we have enough time to reproduce the learning process that lead to $θ_{2}$ ?” it is “Can we directly learn $H_{θ_{2}}$ as an approximation to $f_{θ_{1}}$ ?” If we able to do this in any way, then we can use that to help compress $θ_{2}$ . In the other proposal, we can use it to help estimate the specialness of $θ_{2}$ in order to determine how many bits we get back—it’s starting to feel like these things aren’t so different anyway.
Fully learning the whole human-model seems impossible—after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with “enough bits” to represent $θ_{2}$ , so that it could memorize them one by one if it saw the brain scans or whatever).
So we could try to do something like “learning just the part of the human policy is that is about answering questions.” But it’s not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
Overall this seems like a pretty tricky case. The high-level summary is something like: “The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is ‘trying’ to do or how they interpret language.” Under these conditions it seems particularly challenging to either jointly represent $H_{θ_{2}}$ and $f_{θ_{1}}$ , or to compute how many bits you should “get back” based on a consistency condition between them. I expect it’s going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I’ll be thinking about now.