evhub comments on Answering questions honestly instead of predicting human answers: lots of problems and some solutions

evhub 2 Aug 2021 22:02 UTC
LW: 2 AF: 2
AF

Seems like if the different heads do not share weights then “the parameters in $f_{1}$ ” is perfectly well-defined?

It seemed to me like you were using it in a way such that $f_{1}$ shared no weights with $f_{2}$ , which I think was because you were confused by the quantification, like you said previously. I think we’re on the same page now.

Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ ?

Sorry, I was unclear about this in my last response. $f^{+}$ and $f^{-}$ will only agree in cases where the human understands what’s happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the $H_understands$ check which ensures that we don’t have to satisfy the condition when the human would get the question wrong.
- Rohin Shah 3 Aug 2021 7:52 UTC
  LW: 4 AF: 4
  AF Parent
  I think I might be missing a change you made to the algorithm. Can $θ_{1}$ write an arbitrary program for $f_{?}$ ? In that case, what prevents you from getting
```
def M_theta_1_plus(theta_2, x, q):
    axioms = world_model(theta_2=theta_2)(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": True,
    }
```
  It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
  It seemed to me like you were using it in a way such that $f_{1}$ shared no weights with $f_{2}$
  I mean, I would still have said this because I interpret a “head” $f_{1}$ as “the part after the shared layers”, but I’m also happy to instead treat $f_{1}$ as the entire function $X \times Q \to A$ for which the first head forms part of the implementation.
  - evhub 4 Aug 2021 22:30 UTC
    LW: 4 AF: 4
    AF Parent
    
    Can $θ_{1}$ write an arbitrary program for $f_{?}$ ?
    
    Yes—at least that’s the assumption I’m working under.
    
    It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
    
    I agree that the $θ_{1}$ you’ve described has lower complexity than the intended $θ_{1}$ —but the $θ_{2}$ in this case has higher complexity, since $θ_{2}$ is no longer getting any of its complexity for free from conditioning on the $f_{?}$ condition. And in fact what you’ve just described is precisely the unintended model—what I call $M^{-}$ —that I’m trying to compete against, with the hope being that the savings that $M^{+}$ gives you in $θ_{2}$ are sufficient to compensate for the loss in having to specify $f^{+}$ and H_understands in $θ_{1}$ .
    
    If we calculate the complexity of your proposal, we get $\begin{matrix} complexity (M^{-}) = complexity (θ_{1}^{-}) + complexity (θ_{2}^{-} | M^{-} |_{f_{?}}) = complexity (W - H) + complexity (f^{-}) + complexity (H | True) = complexity (W - H) + complexity (f^{-}) + complexity (H) \approx complexity (W) \end{matrix}$ whereas, if we calculate the complexity of the intended $M^{+}$ , we get $complexity(M+)=complexity(θ+1)+complexity(θ+2 | M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H | H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}$ such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on $H_understands \to f^{+} = f^{-}$ offsets the cost of having to specify $f^{+}$ and $H_understands$ .
    - Rohin Shah 5 Aug 2021 8:33 UTC
      LW: 4 AF: 4
      AF Parent
      such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on $H_understands \to f^{+} = f^{-}$ offsets the cost of having to specify $f^{+}$ and $H_understands$ .
      Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.