paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 5 Jul 2021 19:24 UTC
LW: 7 AF: 4
AF
Here’s an example I’ve been thinking about today to investigate the phenomenon of re-using human models.
Suppose that the “right” way to answer questions is $f_{θ_{1}}$ . And suppose that a human is a learned model $H_{θ_{2}}$ trained by gradient descent to approximate $f_{θ_{1}}$ (subject to architectural and computational constraints). This model is very good on distribution, but we expect it to fail off distribution. We want to train a new neural network to approximate $f_{θ_{1}}$ , without inheriting the human’s off-distribution failures (though the new network may have off-distribution failures of its own).
The problem is that our model needs to learn the exact parameters $θ_{2}$ for the human model in order to other aspects of human behavior. The simplest case is that we sometimes directly open human brains to observe $θ_{2}$ directly.
Once we’ve learned $θ_{2}$ it is very easy to learn the question-answering policy $H_{θ_{2}}$ . So we’re worried that our model will do that rather than learning the additional parameters $θ_{1}$ to implement $f_{θ_{1}}$ .
Intuitively there is a strong connection between $f_{θ_{1}}$ and $H_{θ_{2}}$ . After all, $θ_{2}$ is optimized to make them nearly equal on the training distribution. If you understood the dynamics of neural network training it is likely possible to essentially reconstruct $f_{θ_{1}}$ from $θ_{2}$ , i.e. the complexity fo specifying both $θ_{1}$ and $θ_{2}$ is essentially the same as the complexity of specifying only $θ_{2}$ .
But it’s completely unclear how to jointly represent $f_{θ_{1}}$ and $H_{θ_{2}}$ using some parameters $θ_{3}$ of similar size to $θ_{2}$ . So prima facie there is a strong temptation to just reuse $H_{θ_{2}}$ .
- paulfchristiano 5 Jul 2021 19:50 UTC
  LW: 5 AF: 3
  AF Parent
  How much hope is there for jointly representing $f_{θ_{1}}$ and $H_{θ_{2}}$ ?
  The most obvious representation in this case is to first specify $f_{θ_{1}}$ , and then actually model the process of gradient descent that produces $H_{θ_{2}}$ . This runs into a few problems:
  1. Actually running gradient descent to find $θ_{2}$ is too expensive to do at every datapoint—instead we learn a hypothesis that does a lot of its work “up front” (shared across all the datapoints). I don’t know what that would look like. The naive ways of doing it (redoing the shared initial computation for every batch) only work for very large batch sizes, which may well be above the critical batch size. If this was the only problem I’d feel pretty optimistic about spending a bunch of time thinking about it.
  2. Specifying “the process that produced $H_{θ_{2}}$ ” requires specifying the initialization ${˜ θ}_{2}$ , which is as big as $θ_{2}$ . That said, the fact that the learned $θ_{2}$ also contains a bunch of information about $θ_{1}$ means that it can’t contain perfect information about the initialization ${˜ θ}_{2}$ , i.e. that multiple initializations lead to exactly the same final state. So that suggests a possible out: we can start with an “initial” initialization ${˜ θ}_{2}^{0}$ . Then we can learn ${˜ θ}_{2}$ by gradient descent. The fact that many different values of ${˜ θ}_{2}$ would work suggests that it should be easier to find one of them; intuitively if we set up training just right it seems like we may be able to get all the bits back.
  3. Running the gradient descent to find $θ_{2}$ even a single time may be much more expensive than the rest of training. That is, human learning (perhaps extended over biological or cultural evolution) may take much more time than machine learning. If this the case, then any approach that relies on reproducing that learning is completely doomed.
    
    Similar to problem 2, the mutual information between $θ_{2}$ and the learning process that produced $θ_{2}$ also must be kind of low—there are only $| θ_{2} |$ bits of mutual information to go around between the learning process, its initialization, and $θ_{1}$ . But exploiting this structure seems really hard, if actually there aren’t any fast learning processes that lead to the same conclusion
  My current take is that even in the case where $θ_{2}$ was actually produced by something like SGD, we can’t actually exploit that fact to produce a direct, causally-accurate representation $θ_{2}$ .
  That’s kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of $θ_{2}$ , as remarked in the sibling comment.)
  So then the critical question is not “do we have enough time to reproduce the learning process that lead to $θ_{2}$ ?” it is “Can we directly learn $H_{θ_{2}}$ as an approximation to $f_{θ_{1}}$ ?” If we able to do this in any way, then we can use that to help compress $θ_{2}$ . In the other proposal, we can use it to help estimate the specialness of $θ_{2}$ in order to determine how many bits we get back—it’s starting to feel like these things aren’t so different anyway.
  Fully learning the whole human-model seems impossible—after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with “enough bits” to represent $θ_{2}$ , so that it could memorize them one by one if it saw the brain scans or whatever).
  So we could try to do something like “learning just the part of the human policy is that is about answering questions.” But it’s not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
  Overall this seems like a pretty tricky case. The high-level summary is something like: “The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is ‘trying’ to do or how they interpret language.” Under these conditions it seems particularly challenging to either jointly represent $H_{θ_{2}}$ and $f_{θ_{1}}$ , or to compute how many bits you should “get back” based on a consistency condition between them. I expect it’s going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I’ll be thinking about now.
- paulfchristiano 5 Jul 2021 19:24 UTC
  LW: 4 AF: 3
  AF Parent
  The difficulty of jointly representing $f_{θ_{1}}$ and $H_{θ_{2}}$ motivates my recent proposal, which avoids any such explicit representation. Instead it separately specifies $θ_{1}$ and $θ_{2}$ , and then “gets back” bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possible $θ_{2}$ ’s (roughly $exp (- | θ_{1} |)$ of them).
  But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems:
  - In order to calculate this loss function we need to evaluate how “special” $θ_{2}$ is, i.e. how small is the fraction of $θ_{2}$ ’s that are consistent with $θ_{1}$
  - In order to evaluate how special $θ_{2}$ is, we basically need to do the same process of SGD that produces— $θ_{2}$ then we can compare the actual iterates to all of the places that it could have gone in a different direction, and conclude that almost all of the different settings of the parameters would have been much less consistent with $θ_{1}$ .
  - The implicit hope of my proposal is that the outer neural network is learning its human model using something like SGD, and so it can do this specialness-calculation for free—it will be considering lots of different human-models, and it can observe that almost all of them are much less consistent with $θ_{1}$ .
  - But the outer neural network could learn to model humans in a very different way, which may not involve representing a serious of iterates of “plausible alternative human models.” For example, suppose that in each datapoint we observe a few of the bits of $θ_{2}$ directly (e.g. by looking at a brain scan), and we fill in much of $θ_{2}$ in this way before we ever start making good predictions about human behavior. Then we never need to consider any other plausible human-models.
  So in order to salvage a proposal like this, it seems like (at a minimum) the “specialness evaluation” needs to take place separately from the main learning of the human model, using a very different process (where we consider lots of different human models and see that it’s actually quite hard to find one that is similarly-consistent with $θ_{1}$ ). This would take place at the point where the outer model started actually using its human model $H_{θ_{2}}$ in order to answer questions.
  I don’t really know what that would look like or if it’s possible to make anything like that work.
  What links here?
  - paulfchristiano's comment on paulfchristiano’s Shortform by paulfchristiano (5 Jul 2021 19:50 UTC; 5 points)