paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 21 Jul 2021 2:17 UTC
LW: 2 AF: 2
AF
Here’s another approach to “shortest circuit” that is designed to avoid this problem:
- Learn a circuit $C (X)$ that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
- Impose a consistency requirement on those beliefs, even in cases where a human can’t tell the right answer.
- Require $C (X)$ ’s beliefs about $Y$ to match $F_{θ} (X)$ . We hope that this makes $C$ an explication of “ $F_{θ}$ ’s beliefs.”
- Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I’m a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it’s exciting if this proposal gets to that final step.
The “intended” circuit $C$ just follows along with the computation done by $F_{θ}$ and then translates its internal state into natural language.
What about the problem case where $F_{θ}$ computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that $C$ could just read off? I’ll imagine those being written down somewhere on a slip of paper inside of $F_{θ}$ ’s model of the world.
- Suppose that the slip of paper is not relevant to predicting $F_{θ} (X)$ , i.e. it’s a spandrel from the weight sharing. Then the simplest circuit $C$ just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by $C$ , so it seems like we’re in business.
- So suppose that the slip of paper is relevant for predicting $F_{θ} (X)$ , e.g. because someone looks at the slip of paper and then takes an action that affects $Y$ . If (the correct) $Y$ is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by $C$ still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of $C$ .
Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition—the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.