evhub comments on Answering questions honestly instead of predicting human answers: lots of problems and some solutions

evhub 30 Jul 2021 21:15 UTC
LW: 4 AF: 4
AF

I assumed that when you talked about a model with “different heads” you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don’t share any weights, and those separate sequences of layers were the “heads” $f_{1}$ and $f_{2}$ .

Yep, that’s what I mean.

Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don’t-share-weights can be generated by some $x, q$ (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don’t-share-weights.

Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is $θ_{2}$ conditioning on $θ_{1}$ . If we look at the intended model, however, $θ_{1}$ includes all of the parts-which-don’t-share-weights, while $θ_{2}$ is entirely in the part-which-shares-weights.

Technically, I suppose, you can just take the prior and condition on anything you want—but it’s going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from $θ_{1}$ and which came from $θ_{2}$ .

I do agree that, if $θ_{1}$ were to specify the entire part-which-shares-weights and leave $θ_{2}$ to fill in the parts-which-don’t-share-weights, then you would get exactly what you’re describing where $θ_{2}$ would have a doubly-strong neural net prior on implementing the same function for both heads. But that’s only one particular arrangement of $θ_{1}$ —there are lots of other $θ_{1}$ s which induce very different distributions on $θ_{2}$ .

This seems to suggest that $f^{+}, f^{-}$ are different functions, i.e. there’s some input on which they disagree.

Note that the inputs to $f^{+}, f^{-}$ are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent $X \times Q \to A$ maps.
- Rohin Shah 31 Jul 2021 8:16 UTC
  LW: 4 AF: 4
  AF Parent
  Yep, that’s what I mean.
  Then I’m confused what you meant by
  I’m not sure what you mean by this part— $f_{1}$ and $f_{2}$ are just different heads, not entirely different models, so I’m not sure what you mean by “the parameters in $f_{1}$ .”
  Seems like if the different heads do not share weights then “the parameters in $f_{1}$ ” is perfectly well-defined?
  Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing
  Yeah, sorry, by “conditioning” there I meant “assuming that the algorithm correctly chose the right world model in the end”, I wasn’t trying to describe a particular step in the algorithm. But in any case I don’t think we need to talk about that
  They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent $X \times Q \to A$ maps.
  Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ ? My understanding of $f^{+}$ and $f^{-}$ comes from here:
  Specifically, $f^{+}$ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding $q$ as a logical statement and unembedding its answer in $deduced_stmts$ . Conversely, $f^{-}$ is the “mimicry embedding” which just searches for deductions about what a human would say in response to $q$ and outputs that—thus, $f^{-}$ just quotes $q$ , embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.
  If $f^{+}$ and $f^{-}$ produce equivalent $X \times Q \to A$ maps, doesn’t that mean that we’ve just gotten something that can only respond as well as a human? Wouldn’t that be a significant limitation? (E.g. given that I don’t know German, if my question to the model is “what does <german phrase> mean”, does the model have to respond “I don’t know”?)
  In addition, since the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ , it seems like the world model could never produce decision-relevant deduced statements that the human wouldn’t have realized. This seems both (a) hard to enforce and (b) a huge capability hit.
  - evhub 2 Aug 2021 22:02 UTC
    LW: 2 AF: 2
    AF Parent
    
    Seems like if the different heads do not share weights then “the parameters in $f_{1}$ ” is perfectly well-defined?
    
    It seemed to me like you were using it in a way such that $f_{1}$ shared no weights with $f_{2}$ , which I think was because you were confused by the quantification, like you said previously. I think we’re on the same page now.
    
    Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ ?
    
    Sorry, I was unclear about this in my last response. $f^{+}$ and $f^{-}$ will only agree in cases where the human understands what’s happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the $H_understands$ check which ensures that we don’t have to satisfy the condition when the human would get the question wrong.
    - Rohin Shah 3 Aug 2021 7:52 UTC
      LW: 4 AF: 4
      AF Parent
      I think I might be missing a change you made to the algorithm. Can $θ_{1}$ write an arbitrary program for $f_{?}$ ? In that case, what prevents you from getting
      def M_theta_1_plus(theta_2, x, q): axioms = world_model(theta_2=theta_2)(x) deduced_stmts = deduction(axioms) return { "f": f_minus(q, deduced_stmts), "f?": True, }
      It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
      It seemed to me like you were using it in a way such that $f_{1}$ shared no weights with $f_{2}$
      I mean, I would still have said this because I interpret a “head” $f_{1}$ as “the part after the shared layers”, but I’m also happy to instead treat $f_{1}$ as the entire function $X \times Q \to A$ for which the first head forms part of the implementation.
      - evhub 4 Aug 2021 22:30 UTC
        LW: 4 AF: 4
        AF Parent
        
        Can $θ_{1}$ write an arbitrary program for $f_{?}$ ?
        
        Yes—at least that’s the assumption I’m working under.
        
        It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
        
        I agree that the $θ_{1}$ you’ve described has lower complexity than the intended $θ_{1}$ —but the $θ_{2}$ in this case has higher complexity, since $θ_{2}$ is no longer getting any of its complexity for free from conditioning on the $f_{?}$ condition. And in fact what you’ve just described is precisely the unintended model—what I call $M^{-}$ —that I’m trying to compete against, with the hope being that the savings that $M^{+}$ gives you in $θ_{2}$ are sufficient to compensate for the loss in having to specify $f^{+}$ and H_understands in $θ_{1}$ .
        
        If we calculate the complexity of your proposal, we get $\begin{matrix} complexity (M^{-}) = complexity (θ_{1}^{-}) + complexity (θ_{2}^{-} | M^{-} |_{f_{?}}) = complexity (W - H) + complexity (f^{-}) + complexity (H | True) = complexity (W - H) + complexity (f^{-}) + complexity (H) \approx complexity (W) \end{matrix}$ whereas, if we calculate the complexity of the intended $M^{+}$ , we get $complexity(M+)=complexity(θ+1)+complexity(θ+2 | M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H | H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}$ such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on $H_understands \to f^{+} = f^{-}$ offsets the cost of having to specify $f^{+}$ and $H_understands$ .
        Rohin Shah 5 Aug 2021 8:33 UTC
        LW: 4 AF: 4
        AF Parent
        such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on $H_understands \to f^{+} = f^{-}$ offsets the cost of having to specify $f^{+}$ and $H_understands$ .
        Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.