I’m not sure what you mean by this part—f1 and f2 are just different heads, not entirely different models, so I’m not sure what you mean by “the parameters in f1.”
Seems like if the different heads do not share weights then “the parameters in f1” is perfectly well-defined?
Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing
Yeah, sorry, by “conditioning” there I meant “assuming that the algorithm correctly chose the right world model in the end”, I wasn’t trying to describe a particular step in the algorithm. But in any case I don’t think we need to talk about that
They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent X×Q→A maps.
Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between f+ and f− ? My understanding of f+ and f− comes from here:
Specifically, f+ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding q as a logical statement and unembedding its answer in deduced_stmts. Conversely, f− is the “mimicry embedding” which just searches for deductions about what a human would say in response to q and outputs that—thus, f− just quotes q, embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.
If f+ and f− produce equivalent X×Q→A maps, doesn’t that mean that we’ve just gotten something that can only respond as well as a human? Wouldn’t that be a significant limitation? (E.g. given that I don’t know German, if my question to the model is “what does <german phrase> mean”, does the model have to respond “I don’t know”?)
In addition, since the world model will never produce deduced statements that distinguish between f+ and f−, it seems like the world model could never produce decision-relevant deduced statements that the human wouldn’t have realized. This seems both (a) hard to enforce and (b) a huge capability hit.
Seems like if the different heads do not share weights then “the parameters in f1” is perfectly well-defined?
It seemed to me like you were using it in a way such that f1 shared no weights with f2, which I think was because you were confused by the quantification, like you said previously. I think we’re on the same page now.
Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between f+ and f−?
Sorry, I was unclear about this in my last response.f+ and f− will only agree in cases where the human understands what’s happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the H_understands check which ensures that we don’t have to satisfy the condition when the human would get the question wrong.
It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
It seemed to me like you were using it in a way such that f1 shared no weights with f2
I mean, I would still have said this because I interpret a “head” f1 as “the part after the shared layers”, but I’m also happy to instead treat f1 as the entire function X×Q→A for which the first head forms part of the implementation.
Yes—at least that’s the assumption I’m working under.
It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
I agree that the θ1 you’ve described has lower complexity than the intended θ1—but the θ2 in this case has higher complexity, since θ2 is no longer getting any of its complexity for free from conditioning on the f? condition. And in fact what you’ve just described is precisely the unintended model—what I call M−—that I’m trying to compete against, with the hope being that the savings that M+ gives you in θ2 are sufficient to compensate for the loss in having to specify f+ and H_understands in θ1.
If we calculate the complexity of your proposal, we get
complexity(M−)=complexity(θ−1)+complexity(θ−2|M−|f?)=complexity(W−H)+complexity(f−)+complexity(H|True)=complexity(W−H)+complexity(f−)+complexity(H)≈complexity(W)
whereas, if we calculate the complexity of the intended M+, we get
complexity(M+)=complexity(θ+1)+complexity(θ+2|M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H|H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2)|H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2)|H_understandsH=θ2→f+H=θ2=f−H=θ2}
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.
Then I’m confused what you meant by
Seems like if the different heads do not share weights then “the parameters in f1” is perfectly well-defined?
Yeah, sorry, by “conditioning” there I meant “assuming that the algorithm correctly chose the right world model in the end”, I wasn’t trying to describe a particular step in the algorithm. But in any case I don’t think we need to talk about that
Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between f+ and f− ? My understanding of f+ and f− comes from here:
If f+ and f− produce equivalent X×Q→A maps, doesn’t that mean that we’ve just gotten something that can only respond as well as a human? Wouldn’t that be a significant limitation? (E.g. given that I don’t know German, if my question to the model is “what does <german phrase> mean”, does the model have to respond “I don’t know”?)
In addition, since the world model will never produce deduced statements that distinguish between f+ and f−, it seems like the world model could never produce decision-relevant deduced statements that the human wouldn’t have realized. This seems both (a) hard to enforce and (b) a huge capability hit.
It seemed to me like you were using it in a way such that f1 shared no weights with f2, which I think was because you were confused by the quantification, like you said previously. I think we’re on the same page now.
Sorry, I was unclear about this in my last response.f+ and f− will only agree in cases where the human understands what’s happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the H_understands check which ensures that we don’t have to satisfy the condition when the human would get the question wrong.
I think I might be missing a change you made to the algorithm. Can θ1 write an arbitrary program for f?? In that case, what prevents you from getting
It seems like this should be lower complexity than the intended result, since
True
has much lower complexity thanH_understands
?I mean, I would still have said this because I interpret a “head” f1 as “the part after the shared layers”, but I’m also happy to instead treat f1 as the entire function X×Q→A for which the first head forms part of the implementation.
Yes—at least that’s the assumption I’m working under.
I agree that the θ1 you’ve described has lower complexity than the intended θ1—but the θ2 in this case has higher complexity, since θ2 is no longer getting any of its complexity for free from conditioning on the f? condition. And in fact what you’ve just described is precisely the unintended model—what I call M−—that I’m trying to compete against, with the hope being that the savings that M+ gives you in θ2 are sufficient to compensate for the loss in having to specify f+ and
H_understands
in θ1.If we calculate the complexity of your proposal, we get complexity(M−)=complexity(θ−1)+complexity(θ−2 | M−|f?)=complexity(W−H)+complexity(f−)+complexity(H | True)=complexity(W−H)+complexity(f−)+complexity(H)≈complexity(W) whereas, if we calculate the complexity of the intended M+, we get complexity(M+)=complexity(θ+1)+complexity(θ+2 | M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H | H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2} such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.