paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 3 Jul 2021 18:27 UTC
LW: 4 AF: 3
AF
The most fundamental reason that I don’t expect this to work is that it gives up on “sharing parameters” between the extractor and the human model. But in many cases it seems possible to do so, and giving on up on that feels extremely unstable since it’s trying to push against competitiveness (i.e. the model will want to find some way to save those parameters, and you don’t want your intended solution to involve subverting that natural pressure).
Intuitively, I can imagine three kinds of approaches to doing this parameter sharing:
1. Introduce some latent structure $L$ (e.g. semantics of natural language, what a cat “actually is”) that is used to represent both humans and the intended question-answering policy. This is the diagram $H \leftarrow L \to f^{+}$
2. Introduce some consistency check $f^{?}$ between $H$ and $L$ . This is the diagram $H \to f^{?} \leftarrow f^{+}$
3. Somehow extract $f^{+}$ from $H$ or build it out of pieces derived from $H$ . This is the diagram $H \to f^{+}$ . This is kind of like a special case of 1, but it feels pretty different.
(You could imagine having slightly more general diagrams corresponding to any sort of d-connection between $H$ and $f^{+}$ .)
Approach 1 is the most intuitive, and it seems appealing because we can basically leave it up to the model to introduce the factorization (and it feels like there is a good chance that it will happen completely automatically). There are basically two challenges with this approach:
- It’s not clear that we can actually jointly compress $H$ and $f^{+}$ . For example, what if we represent $H$ in an extremely low level way as a bunch of neurons firing; the neurons are connected in a complicated and messy way that learned to implement something like $f^{+}$ , but need not have any simple representation in terms of $f^{+}$ . Even if such a factorization is possible, it’s completely unclear how to argue about how hard it is to learn. This is a lot of what motivates the compression-based approaches—we can just say ” $H$ is some mess, but you can count on it basically computing $f^{+}$ ” and then make simple arguments about competitiveness (it’s basically just as hard as separately learning $H$ and $f^{+}$ ).
- If you overcame that difficulty, you’d still have to actually incentivize this kind of factorization in the model (rather than sharing parameters in the unintended way). It’s unclear how to do that (maybe you’re back to think about something speed-prior like, and this is just a way to address my concern about the speed-prior-like proposals), but this feels more tractable than the first problem.
I’ve been thinking about approach 2 over the last 2 months. My biggest concern is that it feels like you have to pay the bits of H back “as you learn them” with SGD, but you may learn them in such a way that you don’t really get a useful consistency update until you’ve basically specified all of H. (E.g. suppose you are exposed to brain scans of humans for a long time before you learn to answer questions in a human-like way. Then at the end you want to use that to pay back the bits of the brain scans, but in order to do so you need to imagine lots of different ways the brain scans could have looked. But there’s no tractable way to do that, because you have to fill in the the full brain scan before it really tells you about whether your consistency condition holds.)
Approach 3 is in some sense most direct. I think this naturally looks like imitative generalization, where you use a richer set of human answers to basically build $f^{+}$ on top of your model. I don’t see how to make this kind of thing work totally on its own, but I’m probably going to spend a bit of time thinking about how to combine it with approaches 1 and 2.