It sounds like some people have an intuition that the mental algorithms “sample from a conditional generative model” and “search for the argmax / epsilon-close-to-argmax input to a scoring function” are effectively the same. I don’t share that intuition and struggle to communicate across that divide. Like, when I think about it through ML examples (GPT, diffusion models, etc.), those are two very different pieces of code that produce two very different kinds of outputs.
I believe sampling from a conditional distribution is basically equivalent to adding a “cost of action” (where “action” = deviating from the generative model) to argmax search.
Suppose M is your prior distribution, u is your utility function, and you are selecting some policy distribution Π so as to maximize E(u|Π)−KL(Π||M). Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into ∫u−logΠMdΠ, which is equivalent to minimizing ∫logΠMeudΠ or in other words KL(Π||Meu), which happens when Π∝Meu. (I think, I’m rusty on this math so I might have made a mistake.)
This is not 100% equivalent to letting Π be a Bayesian conditioned version of M because Bayesian conditioning involves multiplying M by an indicator function whereas this involves multiplying M by a strictly positive function, but it seems related and probably shares most of its properties.
The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be “sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]”. Even if this isn’t exact, I buy that as at least morally true.
However, I don’t think this establishes the claim I’d been struggling with, which was that there’s some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.
I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0′s view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer’s Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.
I can’t see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier’s sentiment score - λ the logprob of the continuation.
I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as “good good good good...” for 600 tokens.
One possible reason for this divide is that GPTs aren’t really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it’s calibrated to model, including inputs that have significant upwards errors in their probability estimations.
However, I think a “perfect” model of human language might actually assign higher prior probability to a continuation like “good good good...” (or maybe something like “X is good because X is good because X is good...”) than to a “natural” continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.
Seems like you can always implement any function f: X → Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.
Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn’t really provide any way for standard argmax arguments to transfer.
It sounds like some people have an intuition that the mental algorithms “sample from a conditional generative model” and “search for the argmax / epsilon-close-to-argmax input to a scoring function” are effectively the same. I don’t share that intuition and struggle to communicate across that divide. Like, when I think about it through ML examples (GPT, diffusion models, etc.), those are two very different pieces of code that produce two very different kinds of outputs.
I believe sampling from a conditional distribution is basically equivalent to adding a “cost of action” (where “action” = deviating from the generative model) to argmax search.
If you have time, I think it’d be valuable for you to make a case for that.
Suppose M is your prior distribution, u is your utility function, and you are selecting some policy distribution Π so as to maximize E(u|Π)−KL(Π||M). Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into ∫u−logΠMdΠ, which is equivalent to minimizing ∫logΠMeudΠ or in other words KL(Π||Meu), which happens when Π∝Meu. (I think, I’m rusty on this math so I might have made a mistake.)
This is not 100% equivalent to letting Π be a Bayesian conditioned version of M because Bayesian conditioning involves multiplying M by an indicator function whereas this involves multiplying M by a strictly positive function, but it seems related and probably shares most of its properties.
The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be “sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]”. Even if this isn’t exact, I buy that as at least morally true.
However, I don’t think this establishes the claim I’d been struggling with, which was that there’s some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.
I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0′s view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer’s Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.
I can’t see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier’s sentiment score - λ the logprob of the continuation.
I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as “good good good good...” for 600 tokens.
One possible reason for this divide is that GPTs aren’t really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it’s calibrated to model, including inputs that have significant upwards errors in their probability estimations.
However, I think a “perfect” model of human language might actually assign higher prior probability to a continuation like “good good good...” (or maybe something like “X is good because X is good because X is good...”) than to a “natural” continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.
(Haven’t double-checked the math or fully grokked the argument behind it, but strongly upvoted for making a case.)
I would be curious to know if it makes sense to anyone or if anyone agrees/disagrees.
Seems like you can always implement any function f: X → Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.
Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn’t really provide any way for standard argmax arguments to transfer.