It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
It seemed to me like you were using it in a way such that f1 shared no weights with f2
I mean, I would still have said this because I interpret a “head” f1 as “the part after the shared layers”, but I’m also happy to instead treat f1 as the entire function X×Q→A for which the first head forms part of the implementation.
Yes—at least that’s the assumption I’m working under.
It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
I agree that the θ1 you’ve described has lower complexity than the intended θ1—but the θ2 in this case has higher complexity, since θ2 is no longer getting any of its complexity for free from conditioning on the f? condition. And in fact what you’ve just described is precisely the unintended model—what I call M−—that I’m trying to compete against, with the hope being that the savings that M+ gives you in θ2 are sufficient to compensate for the loss in having to specify f+ and H_understands in θ1.
If we calculate the complexity of your proposal, we get
complexity(M−)=complexity(θ−1)+complexity(θ−2|M−|f?)=complexity(W−H)+complexity(f−)+complexity(H|True)=complexity(W−H)+complexity(f−)+complexity(H)≈complexity(W)
whereas, if we calculate the complexity of the intended M+, we get
complexity(M+)=complexity(θ+1)+complexity(θ+2|M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H|H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2)|H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2)|H_understandsH=θ2→f+H=θ2=f−H=θ2}
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.
I think I might be missing a change you made to the algorithm. Can θ1 write an arbitrary program for f?? In that case, what prevents you from getting
It seems like this should be lower complexity than the intended result, since
True
has much lower complexity thanH_understands
?I mean, I would still have said this because I interpret a “head” f1 as “the part after the shared layers”, but I’m also happy to instead treat f1 as the entire function X×Q→A for which the first head forms part of the implementation.
Yes—at least that’s the assumption I’m working under.
I agree that the θ1 you’ve described has lower complexity than the intended θ1—but the θ2 in this case has higher complexity, since θ2 is no longer getting any of its complexity for free from conditioning on the f? condition. And in fact what you’ve just described is precisely the unintended model—what I call M−—that I’m trying to compete against, with the hope being that the savings that M+ gives you in θ2 are sufficient to compensate for the loss in having to specify f+ and
H_understands
in θ1.If we calculate the complexity of your proposal, we get complexity(M−)=complexity(θ−1)+complexity(θ−2 | M−|f?)=complexity(W−H)+complexity(f−)+complexity(H | True)=complexity(W−H)+complexity(f−)+complexity(H)≈complexity(W) whereas, if we calculate the complexity of the intended M+, we get complexity(M+)=complexity(θ+1)+complexity(θ+2 | M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H | H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2} such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.