paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 15 Jul 2021 18:45 UTC
LW: 2 AF: 2
AF
Here’s a slightly more formal algorithm along these lines:
- Assume that both the human’s model $W_{H}$ and the AI’s model $W_{A I}$ are Bayesian networks where you compute the probability distribution over a node $v$ ’s value based on the values of its parents $p a (v)$ . I’ll write $V a l u e s (v)$ for the set of values that a node $v$ can take on (in either model), and $V a l u e s (S)$ for the joint values of a set of nodes $S$ .
- A correspondence tells you how to compute the value of each node $v$ in the human’s model. This consistent of (i) a neighborhood in the AI’s model $N_{A I} (v)$ which suffices to determine $v$ , (ii) a function $f_{v} : V a l u e s (N_{A I} (v)) \to V a l u e s (v)$ .
- Both the AI’s model and the human model contain some distinguished observation nodes. $f$ must be the identity on these nodes.
- An “explanation” of a correspondence consists of a set of nodes $E_{A I} (v)$ in the AI’s model for each node $v$ in the human’s model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution of $v$ given its parents’ values.
- In particular, $N_{A I} (v) \subseteq E_{A I} (v)$ , and $N_{A I} (u) \subseteq E_{A I} (v)$ for all $u \in p a (v)$ . In order to check whether $E_{A I} (v)$ reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren’t in $E_{A I} (v)$ . Then we sample values for the remaining nodes. We can use $f$ to compute the corresponding values for $v$ and all of its parents. And then we can compute the conditional distributions for $v$ given each set of values for its parents.
- We require that the explanations exactly reproduce the conditional probability over $V a l u e s (v)$ given $V a l u e s (p a (v))$ .
- The “cost” of the explanation of $v$ is the sum of the compute required to sample all the nodes in $E_{A I} (v)$ . The “cost” of the correspondence $f_{v}$ is the compute required to evaluate it.
- We search for the set of correspondences and explanations for which the total cost is minimized.
- (Maybe we also have some requirement where the correspondence $f_{v}$ agrees with some training data about $v$ . I’m not really sure about that.)
Reviewing how this behaves in each of the bad cases from the parent:
- It’s very bad to define $f_{v}$ by computing the observation and then using the human’s inference algorithm. The entire motivation for defining it this way was to save on description complexity, but $f_{v}$ is only being penalized for computational complexity. (This also forces every single $E_{A I} (v)$ to include the entire process that generates the observations, which seems like it should be an even bigger deal. But this feels less like the “right” reason and I think it might change for a future version of the scheme.)
- If there is a human in the simulation who knows the value of $v$ , it’s extremely bad to define $N_{A I} (v)$ to be that human. This is because the explanation $E_{A I} (v)$ will need to include the entire process by which the human computes that value. This is guaranteed to be more expensive than just performing that compute in the correspondence $f_{v}$ itself.
- We could define $f_{v}$ to compute the value of $v$ from scratch using the initial values of the simulation, using some more-efficient simulation. But we can’t do this for the observation node $O b s$ , since it is required to map up with the observation node in the AI’s model. So in order to explain the observation node we will need to have $E_{A I} (O b s)$ include the AI’s entire model, or at least the parts leading up to the observation. This means we are basically paying for two full copies of the computation, and so we’re not really benefiting from the fact that the second copy is more efficient. (We are also potentially overlapping a lot of computation between the different $f_{v}$ ’s, but again I’m a bit less sure how robust that is and whether it will hold up in different formalizations.)
There are a lot of problems and missing details in this proposal:
- This requires exactly reproducing the conditional probabilities in the human’s model. But that’s totally unrealistic for anything short of a full simulation—and even for the full simulation it wouldn’t reproduce the probabilities since the human model isn’t accurate. So we need some way to choose a good enough explanation, i.e. a way of balancing the computational complexity of the explanation against the quality of the conditional probabilities that come out.
- We’re sampling the inputs to $E_{A I} (v)$ uniformly at random. This seems unlikely to work in general. We could easily sample each node from its marginal, but most of the action is in the correlation. Allowing arbitrary correlations causes problems (since you could just specify the “human is accurate” correlation and then read off the correct answers from there). So I think probably some more flexible system is needed here; there are a lot of options but it’s tricky.
- There is something problematic about the overlapping explanations $E_{A I} (v)$ . If they overlap you need to pay for all of them, but for the intended mapping there will often be quite significant overlap. This isn’t inherently a problem, but I’m scared that it’s going to introduce a lot of pressure towards some different correspondence that is able to avoid that problem. We need to penalize overlap because the case where the training data is embedded in the model—the main problem with that model is that you need to separately explain every way in which the human is correct with highly overlapping explanations. If you didn’t penalize those then you may just end up with the embedded explanations (for which $f_{v}$ is extremely cheap).
- There is something tricky about uniformity in the model and in the implementations of $f_{v}$ .
- I’m still scared about the “recompute everything from scratch” failure mode. The model does need to have a single explanation $E_{A I} (O b s)$ that needs to include the whole model. But (i) it doesn’t have to reproduce work, (ii) it can cut out all the stuff not on the path to the observation. So the obvious reason that this one loses is by the duplicated work in $f_{v}$ . Hopefully that’s actually robust.
- We are making really strong structural assumptions on the models and the correspondence between them. We get some things for free (because humans actually do have extra structure in our beliefs about the world that is properly part of the problem statement, and the AI’s model is constrained by its architecture) but not nearly this much.
Overall I’m becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:
- Using “speed” instead of “simplicity” as the ~only requirement for these correspondences.
- Having separate correspondences for separate properties and not allowing them to share tons of computation with each other (to prevent re-running the whole simulation).
- Forcing the model to explain correlations, so that using an “embedded” copy of the answers (like a simulation of the data-generating process) forces you to reproduce the computation that produced that answer.
My next step would probably be looking at cases where these high-level ingredients aren’t sufficient (e.g. are there cases where “generate obs then do inference in the human model” is actually cheaper?). If they look pretty good, then I’ll spend some more time trying to fill in the details in a more plausible way.
What links here?
- paulfchristiano 23 Jul 2021 0:52 UTC
  LW: 2 AF: 2
  AF Parent
  We might be able to get similar advantages with a more general proposal like:
  Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.
  Then the idea is that matching the conditional probabilities from the human’s model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.
  It’s not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It’s also not clear what happens if this consistency condition is soft.
  It’s not clear what “verify that the consistency conditions are met” means. You can always do the same proposal as in the parent, though it’s not really clear if that’s a convincing verification. But I think that’s a fundamental philosophical problem that both of these proposals need to confront.
  It’s not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don’t think this is worse than the prior proposal.
  Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.