paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 5 Jul 2021 19:24 UTC
LW: 4 AF: 3
AF
The difficulty of jointly representing $f_{θ_{1}}$ and $H_{θ_{2}}$ motivates my recent proposal, which avoids any such explicit representation. Instead it separately specifies $θ_{1}$ and $θ_{2}$ , and then “gets back” bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possible $θ_{2}$ ’s (roughly $exp (- | θ_{1} |)$ of them).
But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems:
- In order to calculate this loss function we need to evaluate how “special” $θ_{2}$ is, i.e. how small is the fraction of $θ_{2}$ ’s that are consistent with $θ_{1}$
- In order to evaluate how special $θ_{2}$ is, we basically need to do the same process of SGD that produces— $θ_{2}$ then we can compare the actual iterates to all of the places that it could have gone in a different direction, and conclude that almost all of the different settings of the parameters would have been much less consistent with $θ_{1}$ .
- The implicit hope of my proposal is that the outer neural network is learning its human model using something like SGD, and so it can do this specialness-calculation for free—it will be considering lots of different human-models, and it can observe that almost all of them are much less consistent with $θ_{1}$ .
- But the outer neural network could learn to model humans in a very different way, which may not involve representing a serious of iterates of “plausible alternative human models.” For example, suppose that in each datapoint we observe a few of the bits of $θ_{2}$ directly (e.g. by looking at a brain scan), and we fill in much of $θ_{2}$ in this way before we ever start making good predictions about human behavior. Then we never need to consider any other plausible human-models.
So in order to salvage a proposal like this, it seems like (at a minimum) the “specialness evaluation” needs to take place separately from the main learning of the human model, using a very different process (where we consider lots of different human models and see that it’s actually quite hard to find one that is similarly-consistent with $θ_{1}$ ). This would take place at the point where the outer model started actually using its human model $H_{θ_{2}}$ in order to answer questions.
I don’t really know what that would look like or if it’s possible to make anything like that work.
What links here?
- paulfchristiano's comment on paulfchristiano’s Shortform by paulfchristiano (5 Jul 2021 19:50 UTC; 5 points)