paulfchristiano comments on Open question: are minimal circuits daemon-free?

paulfchristiano 29 Jun 2021 1:08 UTC
LW: 11 AF: 6
AF
I consider the argument in this post a reasonably convincing negative answer to this question—a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.
This suggests a second informal clarification of the problem (in addition to Wei Dai’s comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?
If the search for minimal circuits was itself safe then there’s still some hope for solutions that avoid deception by somehow penalizing computational cost. Namely: if that techniques is competitive, then we can try to provide a loss that encourages any learned optimization to use the same techniques.
(I’ve previously thought about this mostly in the high-stakes setting, but I’m now thinking about it in the context of incentivizing honest answers in the low-stakes setting. The following story will focus on the low-stakes setting since I don’t want to introduce extra ingredients to handle high stakes.)
To illustrate, suppose there was a trick where you can divide your model into two parts, one of which is aggressively penalized for computation, and that this lets you do search in a way that is competitive but never produces deception. Call this a “safe” search. The analogy to Evan’s counterexample is a model M produced by “safe” search which internally performs an unsafe search for a new model X that performs well on a mini-training set, then uses X to predict something, and therefore gets treacherous behavior. (Note that in this setting the learned model X* normally generalizes well to its “mini-test” set, and only fails on the outer test set.)
Consider a human (using amplification+imitative generalization+etc.) which is evaluating M’s forecast to decide whether it is good, in a context where the human can’t directly evaluate the answer (e.g. because it is about a long time horizon).
If M makes the prediction P, the human’s best evidence about the quality of P will likely be (i) X is good because it performs well on the mini-training set, (ii) X outputs the forecast P. But if safe search were competitive, then M could instead do a safe search for a model X* that did well on the mini-training set, and then use X* to make a prediction P*. If the evaluator understands the rationale for performing safe search, then in cases where X* and X disagree they will expect X*’s predictions to be better. This means that the human’s evaluation pushes towards safely learning X* instead of unsafely learning X, as desired.
So all we needed was to implement safe search on the outside and ensure that the evaluation process itself is familiar with the arguments for safe search. It’s not too surprising if we need something like this—anytime our model performs internal cognitive work that presents an alignment risk, we need it to take all of the same alignment precautions that we would. (If successful this would be another example of a mundane solution to an exotic problem.) It’s unclear if its efforts will succeed, but our hope is just to make sure that our model is in as good a place as we are.
I think it’s totally unclear if a “speed-prior-like” approach to improving generalization could actually be applied recursively in this way. But I think it looks plausible enough that we shouldn’t let this counterexample scare us away from “speed-prior-like” approaches.