It seems like at this point your prior is “generate parameters randomly under the constraint that the two heads are identical”
That’s not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don’t need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you’re content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you’re describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).
Hmm, I’m not thinking about the complexity part at all right now; I’m just thinking mechanically about what is implied by your equations.
the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.”
I’m not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters). As far as I can tell, the intended condition is “the two heads are identical” in the dataset-less case. Looking directly at the math, the equations you have are:
θ1∼p(θ1)
θ2∼p(θ2 | θ1)⋅I[∀x∈X. ∀q∈Q. Mθ1,θ2|f?(x,q)]
My interpretation is:
Generate θ1 randomly.
Generate θ2 randomly from θ1, subject to the constraint that the two heads output the same value on all possible inputs.
Imagine there was a bijection between model parameters and resulting function. (I’m aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters. In which case you could just have generated parameters for the first head, and then copied them over into the second head, rather than go through this complicated setup.
Now, there isn’t actually a bijection between model parameters and resulting function. But it seems like the only difference is that you make it more likely that you sample heads which have lots of different implementations in model parameters, i.e. you’re doubling the strength of the neural net prior (and that’s the only effect). This seems undesirable?
Hmm, I’m not thinking about the complexity part at all right now; I’m just thinking mechanically about what is implied by your equations.
The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it’s just that some are more/less likely now.
though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters
Yep, that’s exactly right.
Imagine there was a bijection between model parameters and resulting function. (I’m aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.
That’s definitely not what should happen in that case. Note that there is no relation between θ1 and f1 or θ2 and f2—both sets of parameters contribute equally to both heads. Thus, θ1 can enforce any condition it wants on θ2 by leaving some particular hole in how it computes f1 and f2 and forcing θ2 to fill in that hole in such a way to make θ1’s computation of the two heads come out equal.
The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it’s just that some are more/less likely now.
Yeah, sorry, I wasn’t clear here—I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations until both θ1 and θ2 are chosen, and then reasoning about the thing you get afterwards.
Note that there is no relation between θ1 and f1 or θ2 and f2—both sets of parameters contribute equally to both heads. Thus, θ1 can enforce any condition it wants on θ2 by leaving some particular hole in how it computes f1 and f2 and forcing θ2 to fill in that hole in such a way to make θ1’s computation of the two heads come out equal.
Yes, I think I understand that. (I want to note that since θ1 is chosen randomly, it isn’t “choosing” the condition on θ2; rather the wide distribution over θ1 leads to a wide distribution over possible conditions on θ2. But I think that’s what you mean.)
That’s definitely not what should happen in that case.
I think you misunderstood what I was claiming. Let me try again, without using the phrase “enforcing the constraint”, which I think was the problem.
Imagine there was a bijection between model parameters and resulting function. In Stage 1 you sample θ1 randomly. In Stage 2, you sample θ2, such that it fills in the holes in f1 and f2 to make f1 and f2 compute the same function. By our bijection assumption, the parameters in f1 must be identical to the parameters in f2. Thus, we can conclude the following:
If θ1 contained a parameter from f1 and f2 in the same location (e.g. it includes the weight at position (3, 5) in layer 3 in both f1 and f2), then it must have assigned the same value to both of them.
If θ1 contained a parameter from f1 and θ2 contained the corresponding parameter from f2, then θ2 must have set that parameter to the same value as in θ1.
If θ2 contained a parameter from f1 and f2 in the same location, then it must have assigned the same value to both of them.
These constraints are necessary and sufficient to satisfy the overall constraint that f1=f2, and therefore any other parameters in θ2 are completely unconstrained and are set according to the original neural net prior.
So it seems to me that (1) any parameters not in f1 or f2 are set according to the original neural net prior, and (2) parameters in f1 must be identical to the corresponding parameters in f2, but their values are chosen according to the neural net prior.
This seems equivalent to having a single head f1, sampling its parameters from the original prior, and then copying those parameters into f2.
I think you should already be pretty worried by the fact that this seems to give weird results when assuming a bijection between model parameters and resulting functions, but let’s analyze it without the bijection assumption too:
Since f1 and f2 have to be identical on all inputs, it doesn’t matter what input they get, and therefore there is no constraint on the part of the neural net that is generating the inputs. So, we still get (1): any parameters not in f1 or f2 are set according to the original neural net prior. (2) is no longer true, but instead of getting that parameters in f1 are equivalent to parameters in f2, we get that the function implemented by f1 is equivalent to the function implemented by f2. Since ultimately the generating process is “sample parameters until f1=f2”, the probability of getting a particular function f is proportional to the square of the probability of generating parameters for that function Pθ∼OrigPrior(Mθ=f) (since you have to successfully generate the function twice). So, you are doubling the strength of the neural net prior in the heads, and leaving the strength the same in the world model (i.e. all parts except for the head).
Yeah, sorry, I wasn’t clear here—I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations
Sure, makes sense—theoretically, that should be isomorphic.
I want to note that since θ1 is chosen randomly, it isn’t “choosing” the condition on θ2; rather the wide distribution over θ1 leads to a wide distribution over possible conditions on θ2. But I think that’s what you mean.
This seems like a case where I’m using the more constructive formulation of simulating out the equations and you’re thinking about in a more complexity-oriented framing. Of course, again, they should be equivalent.
By our bijection assumption, the parameters in f1 must be identical to the parameters in f2.
I’m not sure what you mean by this part—f1 and f2 are just different heads, not entirely different models, so I’m not sure what you mean by “the parameters in f1.” I don’t think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if f1 and f2 were separate models such that they couldn’t reuse weights between them, then none of the complexity arguments that I make in the post would go through.
These constraints are necessary and sufficient to satisfy the overall constraint that f1=f2, and therefore any other parameters in θ2 are completely unconstrained and are set according to the original neural net prior.
I’m happy to accept that there are ways of setting θ1 (e.g. just make f1 and f2 identical) such that the rest of the parameters are unconstrained and just use the neural net prior. However, that’s not the only way of setting θ1—and not the most complexity-efficient, I would argue. In the defender’s argument, θ1 sets all the head-specific parameters for both f1 and f2 to enforce that f1 computes f+ and f2 computes f−, and also sets all the shared parameters for everything other than the human model, while leaving the human model to θ2, thus enforcing that θ2 specify a human model that’s correct enough to make f+=f− without having to pay any extra bits to do so.
I’m not sure what you mean by this part—f1 and f2 are just different heads, not entirely different models, so I’m not sure what you mean by “the parameters in f1.” I don’t think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if f1 and f2 were separate models such that they couldn’t reuse weights between them, then none of the complexity arguments that I make in the post would go through.
I assumed that when you talked about a model with “different heads” you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don’t share any weights, and those separate sequences of layers were the “heads” f1 and f2. (I’m pretty sure that’s how the term is normally used in ML.) I might benefit from an example architecture diagram where you label what θ1,θ2,f1,f2 are.
I did realize that I was misinterpreting part of the math—the ∀x,q is quantifying over inputs to the overall neural net, rather than to the parts-which-don’t-share-weights. My argument only goes through if you quantify the constraint over all inputs to the parts-which-don’t-share-weights. Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don’t-share-weights can be generated by some x,q (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don’t-share-weights.
In the defender’s argument, θ1 sets all the head-specific parameters for both f1 and f2 to enforce that f1 computes f+ and f2 computes f−
This seems to suggest that f+ and f− are different functions, i.e. there’s some input on which they disagree. But then θ2 has to make them agree on all possible x,q. So is the idea that there are some inputs to f+, f− that can never be created with any possible x,q? That seems… strange (though not obviously impossible).
I assumed that when you talked about a model with “different heads” you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don’t share any weights, and those separate sequences of layers were the “heads” f1 and f2.
Yep, that’s what I mean.
Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don’t-share-weights can be generated by some x,q (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don’t-share-weights.
Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is θ2 conditioning on θ1. If we look at the intended model, however, θ1 includes all of the parts-which-don’t-share-weights, while θ2 is entirely in the part-which-shares-weights.
Technically, I suppose, you can just take the prior and condition on anything you want—but it’s going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from θ1 and which came from θ2.
I do agree that, if θ1 were to specify the entire part-which-shares-weights and leave θ2 to fill in the parts-which-don’t-share-weights, then you would get exactly what you’re describing where θ2 would have a doubly-strong neural net prior on implementing the same function for both heads. But that’s only one particular arrangement of θ1—there are lots of other θ1s which induce very different distributions on θ2.
This seems to suggest that f+,f− are different functions, i.e. there’s some input on which they disagree.
Note that the inputs to f+,f− are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent X×Q→A maps.
I’m not sure what you mean by this part—f1 and f2 are just different heads, not entirely different models, so I’m not sure what you mean by “the parameters in f1.”
Seems like if the different heads do not share weights then “the parameters in f1” is perfectly well-defined?
Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing
Yeah, sorry, by “conditioning” there I meant “assuming that the algorithm correctly chose the right world model in the end”, I wasn’t trying to describe a particular step in the algorithm. But in any case I don’t think we need to talk about that
They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent X×Q→A maps.
Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between f+ and f− ? My understanding of f+ and f− comes from here:
Specifically, f+ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding q as a logical statement and unembedding its answer in deduced_stmts. Conversely, f− is the “mimicry embedding” which just searches for deductions about what a human would say in response to q and outputs that—thus, f− just quotes q, embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.
If f+ and f− produce equivalent X×Q→A maps, doesn’t that mean that we’ve just gotten something that can only respond as well as a human? Wouldn’t that be a significant limitation? (E.g. given that I don’t know German, if my question to the model is “what does <german phrase> mean”, does the model have to respond “I don’t know”?)
In addition, since the world model will never produce deduced statements that distinguish between f+ and f−, it seems like the world model could never produce decision-relevant deduced statements that the human wouldn’t have realized. This seems both (a) hard to enforce and (b) a huge capability hit.
Seems like if the different heads do not share weights then “the parameters in f1” is perfectly well-defined?
It seemed to me like you were using it in a way such that f1 shared no weights with f2, which I think was because you were confused by the quantification, like you said previously. I think we’re on the same page now.
Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between f+ and f−?
Sorry, I was unclear about this in my last response.f+ and f− will only agree in cases where the human understands what’s happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the H_understands check which ensures that we don’t have to satisfy the condition when the human would get the question wrong.
It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
It seemed to me like you were using it in a way such that f1 shared no weights with f2
I mean, I would still have said this because I interpret a “head” f1 as “the part after the shared layers”, but I’m also happy to instead treat f1 as the entire function X×Q→A for which the first head forms part of the implementation.
Yes—at least that’s the assumption I’m working under.
It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?
I agree that the θ1 you’ve described has lower complexity than the intended θ1—but the θ2 in this case has higher complexity, since θ2 is no longer getting any of its complexity for free from conditioning on the f? condition. And in fact what you’ve just described is precisely the unintended model—what I call M−—that I’m trying to compete against, with the hope being that the savings that M+ gives you in θ2 are sufficient to compensate for the loss in having to specify f+ and H_understands in θ1.
If we calculate the complexity of your proposal, we get
complexity(M−)=complexity(θ−1)+complexity(θ−2|M−|f?)=complexity(W−H)+complexity(f−)+complexity(H|True)=complexity(W−H)+complexity(f−)+complexity(H)≈complexity(W)
whereas, if we calculate the complexity of the intended M+, we get
complexity(M+)=complexity(θ+1)+complexity(θ+2|M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H|H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2)|H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2)|H_understandsH=θ2→f+H=θ2=f−H=θ2}
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.
Imagine there was a bijection between model parameters and resulting function. (I’m aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.
AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of “collapsed” solutions like the one you mentioned (where both networks are learning the same mapping, such that there’s less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).
Nonetheless, since reading Paul’s statement about the problem of the instrumental model, I’ve been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to “maximize agreement” with a set of non-trivial observations/facts that are guaranteed to be more “objective” (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven’t had any promising proposals come to light for generative LM.
I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because “generative models” might reflect the ideal approach to unsupervised learning, whereas “contrastive learning” is sometimes seen as a sort of compromise since (unlike generative models) it’s amenable to limited compute [2].
It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I’m realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
If memory serves, with BYOL you are using current representations of an input x1 to predict representations of a related input x2, but the representation of x2 comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.
(Mostly my point is that there are specific algorithmic reasons to expect that you don’t get the collapsed solutions, it isn’t just a tendency of neural nets to avoid collapsed solutions.)
but now I’m realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
No worries, I think it’s still a relevant example for thinking about “collapsed” solutions.
That’s not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don’t need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you’re content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you’re describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).
Hmm, I’m not thinking about the complexity part at all right now; I’m just thinking mechanically about what is implied by your equations.
I’m not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters). As far as I can tell, the intended condition is “the two heads are identical” in the dataset-less case. Looking directly at the math, the equations you have are:
My interpretation is:
Generate θ1 randomly.
Generate θ2 randomly from θ1, subject to the constraint that the two heads output the same value on all possible inputs.
Imagine there was a bijection between model parameters and resulting function. (I’m aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters. In which case you could just have generated parameters for the first head, and then copied them over into the second head, rather than go through this complicated setup.
Now, there isn’t actually a bijection between model parameters and resulting function. But it seems like the only difference is that you make it more likely that you sample heads which have lots of different implementations in model parameters, i.e. you’re doubling the strength of the neural net prior (and that’s the only effect). This seems undesirable?
The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it’s just that some are more/less likely now.
Yep, that’s exactly right.
That’s definitely not what should happen in that case. Note that there is no relation between θ1 and f1 or θ2 and f2—both sets of parameters contribute equally to both heads. Thus, θ1 can enforce any condition it wants on θ2 by leaving some particular hole in how it computes f1 and f2 and forcing θ2 to fill in that hole in such a way to make θ1’s computation of the two heads come out equal.
Yeah, sorry, I wasn’t clear here—I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations until both θ1 and θ2 are chosen, and then reasoning about the thing you get afterwards.
Yes, I think I understand that. (I want to note that since θ1 is chosen randomly, it isn’t “choosing” the condition on θ2; rather the wide distribution over θ1 leads to a wide distribution over possible conditions on θ2. But I think that’s what you mean.)
I think you misunderstood what I was claiming. Let me try again, without using the phrase “enforcing the constraint”, which I think was the problem.
Imagine there was a bijection between model parameters and resulting function. In Stage 1 you sample θ1 randomly. In Stage 2, you sample θ2, such that it fills in the holes in f1 and f2 to make f1 and f2 compute the same function. By our bijection assumption, the parameters in f1 must be identical to the parameters in f2. Thus, we can conclude the following:
If θ1 contained a parameter from f1 and f2 in the same location (e.g. it includes the weight at position (3, 5) in layer 3 in both f1 and f2), then it must have assigned the same value to both of them.
If θ1 contained a parameter from f1 and θ2 contained the corresponding parameter from f2, then θ2 must have set that parameter to the same value as in θ1.
If θ2 contained a parameter from f1 and f2 in the same location, then it must have assigned the same value to both of them.
These constraints are necessary and sufficient to satisfy the overall constraint that f1=f2, and therefore any other parameters in θ2 are completely unconstrained and are set according to the original neural net prior.
So it seems to me that (1) any parameters not in f1 or f2 are set according to the original neural net prior, and (2) parameters in f1 must be identical to the corresponding parameters in f2, but their values are chosen according to the neural net prior.
This seems equivalent to having a single head f1, sampling its parameters from the original prior, and then copying those parameters into f2.
I think you should already be pretty worried by the fact that this seems to give weird results when assuming a bijection between model parameters and resulting functions, but let’s analyze it without the bijection assumption too:
Since f1 and f2 have to be identical on all inputs, it doesn’t matter what input they get, and therefore there is no constraint on the part of the neural net that is generating the inputs. So, we still get (1): any parameters not in f1 or f2 are set according to the original neural net prior. (2) is no longer true, but instead of getting that parameters in f1 are equivalent to parameters in f2, we get that the function implemented by f1 is equivalent to the function implemented by f2. Since ultimately the generating process is “sample parameters until f1=f2”, the probability of getting a particular function f is proportional to the square of the probability of generating parameters for that function Pθ∼OrigPrior(Mθ=f) (since you have to successfully generate the function twice). So, you are doubling the strength of the neural net prior in the heads, and leaving the strength the same in the world model (i.e. all parts except for the head).
Sure, makes sense—theoretically, that should be isomorphic.
This seems like a case where I’m using the more constructive formulation of simulating out the equations and you’re thinking about in a more complexity-oriented framing. Of course, again, they should be equivalent.
I’m not sure what you mean by this part—f1 and f2 are just different heads, not entirely different models, so I’m not sure what you mean by “the parameters in f1.” I don’t think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if f1 and f2 were separate models such that they couldn’t reuse weights between them, then none of the complexity arguments that I make in the post would go through.
I’m happy to accept that there are ways of setting θ1 (e.g. just make f1 and f2 identical) such that the rest of the parameters are unconstrained and just use the neural net prior. However, that’s not the only way of setting θ1—and not the most complexity-efficient, I would argue. In the defender’s argument, θ1 sets all the head-specific parameters for both f1 and f2 to enforce that f1 computes f+ and f2 computes f−, and also sets all the shared parameters for everything other than the human model, while leaving the human model to θ2, thus enforcing that θ2 specify a human model that’s correct enough to make f+=f− without having to pay any extra bits to do so.
I assumed that when you talked about a model with “different heads” you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don’t share any weights, and those separate sequences of layers were the “heads” f1 and f2. (I’m pretty sure that’s how the term is normally used in ML.) I might benefit from an example architecture diagram where you label what θ1,θ2,f1,f2 are.
I did realize that I was misinterpreting part of the math—the ∀x,q is quantifying over inputs to the overall neural net, rather than to the parts-which-don’t-share-weights. My argument only goes through if you quantify the constraint over all inputs to the parts-which-don’t-share-weights. Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don’t-share-weights can be generated by some x,q (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don’t-share-weights.
This seems to suggest that f+ and f− are different functions, i.e. there’s some input on which they disagree. But then θ2 has to make them agree on all possible x,q. So is the idea that there are some inputs to f+, f− that can never be created with any possible x,q? That seems… strange (though not obviously impossible).
Yep, that’s what I mean.
Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is θ2 conditioning on θ1. If we look at the intended model, however, θ1 includes all of the parts-which-don’t-share-weights, while θ2 is entirely in the part-which-shares-weights.
Technically, I suppose, you can just take the prior and condition on anything you want—but it’s going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from θ1 and which came from θ2.
I do agree that, if θ1 were to specify the entire part-which-shares-weights and leave θ2 to fill in the parts-which-don’t-share-weights, then you would get exactly what you’re describing where θ2 would have a doubly-strong neural net prior on implementing the same function for both heads. But that’s only one particular arrangement of θ1—there are lots of other θ1s which induce very different distributions on θ2.
Note that the inputs to f+,f− are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent X×Q→A maps.
Then I’m confused what you meant by
Seems like if the different heads do not share weights then “the parameters in f1” is perfectly well-defined?
Yeah, sorry, by “conditioning” there I meant “assuming that the algorithm correctly chose the right world model in the end”, I wasn’t trying to describe a particular step in the algorithm. But in any case I don’t think we need to talk about that
Okay, so iiuc you’re relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between f+ and f− ? My understanding of f+ and f− comes from here:
If f+ and f− produce equivalent X×Q→A maps, doesn’t that mean that we’ve just gotten something that can only respond as well as a human? Wouldn’t that be a significant limitation? (E.g. given that I don’t know German, if my question to the model is “what does <german phrase> mean”, does the model have to respond “I don’t know”?)
In addition, since the world model will never produce deduced statements that distinguish between f+ and f−, it seems like the world model could never produce decision-relevant deduced statements that the human wouldn’t have realized. This seems both (a) hard to enforce and (b) a huge capability hit.
It seemed to me like you were using it in a way such that f1 shared no weights with f2, which I think was because you were confused by the quantification, like you said previously. I think we’re on the same page now.
Sorry, I was unclear about this in my last response.f+ and f− will only agree in cases where the human understands what’s happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the H_understands check which ensures that we don’t have to satisfy the condition when the human would get the question wrong.
I think I might be missing a change you made to the algorithm. Can θ1 write an arbitrary program for f?? In that case, what prevents you from getting
It seems like this should be lower complexity than the intended result, since
True
has much lower complexity thanH_understands
?I mean, I would still have said this because I interpret a “head” f1 as “the part after the shared layers”, but I’m also happy to instead treat f1 as the entire function X×Q→A for which the first head forms part of the implementation.
Yes—at least that’s the assumption I’m working under.
I agree that the θ1 you’ve described has lower complexity than the intended θ1—but the θ2 in this case has higher complexity, since θ2 is no longer getting any of its complexity for free from conditioning on the f? condition. And in fact what you’ve just described is precisely the unintended model—what I call M−—that I’m trying to compete against, with the hope being that the savings that M+ gives you in θ2 are sufficient to compensate for the loss in having to specify f+ and
H_understands
in θ1.If we calculate the complexity of your proposal, we get complexity(M−)=complexity(θ−1)+complexity(θ−2 | M−|f?)=complexity(W−H)+complexity(f−)+complexity(H | True)=complexity(W−H)+complexity(f−)+complexity(H)≈complexity(W) whereas, if we calculate the complexity of the intended M+, we get complexity(M+)=complexity(θ+1)+complexity(θ+2 | M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H | H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2} such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on H_understands→f+=f− offsets the cost of having to specify f+ and H_understands.
Yeah, that makes sense. I guess I don’t really see the intuition about why this should be true, but fair enough to leave that as an open question.
AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of “collapsed” solutions like the one you mentioned (where both networks are learning the same mapping, such that there’s less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).
Nonetheless, since reading Paul’s statement about the problem of the instrumental model, I’ve been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to “maximize agreement” with a set of non-trivial observations/facts that are guaranteed to be more “objective” (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven’t had any promising proposals come to light for generative LM.
I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because “generative models” might reflect the ideal approach to unsupervised learning, whereas “contrastive learning” is sometimes seen as a sort of compromise since (unlike generative models) it’s amenable to limited compute [2].
I haven’t read the paper, but in contrastive learning, aren’t these solutions prevented by the negative examples?
It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I’m realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
If memory serves, with BYOL you are using current representations of an input x1 to predict representations of a related input x2, but the representation of x2 comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.
(Mostly my point is that there are specific algorithmic reasons to expect that you don’t get the collapsed solutions, it isn’t just a tendency of neural nets to avoid collapsed solutions.)
No worries, I think it’s still a relevant example for thinking about “collapsed” solutions.