Markov and general next-token generators work well when conditioned with text. While some models, such as Bert, are able to predict masked tokens I’m not aware of models that are able to generate the most likely sentence that would sit between a given start/end prompt.
It’s worth working in the Markov setting to get a grounding for what we’re looking for. The core of Markov model is the transition matrix Pij which tells us the conditional likelihood of the token j following immediately after the token i. The rules of conditional probability allow us to write,
p(k|j,i)=p(j,k|i)p(j|i)=p(j|k)p(k|i)p(j|i)
This gives us the probability of a token k occurring immediately between the start/end prompts. In general we’re interested in what happens if we ‘travel’ from the starting token i to the ending token j over T time steps. Say we want to see the distribution of tokens at time step t<T. Then we’d write,
This shows us that we can break up the conditional generation process into a calculation over transition probabilities. We could write this out for an arbitrary sequence of separated words. From this perspective we’d be training a model to perform a regression over the words being generated. This is the sense in which we already use outlines to effectively create regression data-sets to model arguments.
What would be ideal is to find a way to generalize this to a non-Markovian, preferably deep-learning, setting. This is where I’m stuck at the moment. I’d want to understand where the SOTA is on this. The only options that immediately come to mind seem to be tree-search over tokens or RL. From the regression point of view, it seems like you’d want to try fitting the ‘training data’ such that the likelihood for the result is as high as possible.
Markov and general next-token generators work well when conditioned with text. While some models, such as Bert, are able to predict masked tokens I’m not aware of models that are able to generate the most likely sentence that would sit between a given start/end prompt.
It’s worth working in the Markov setting to get a grounding for what we’re looking for. The core of Markov model is the transition matrix Pij which tells us the conditional likelihood of the token j following immediately after the token i. The rules of conditional probability allow us to write,
p(k|j,i)=p(j,k|i)p(j|i)=p(j|k)p(k|i)p(j|i)
This gives us the probability of a token k occurring immediately between the start/end prompts. In general we’re interested in what happens if we ‘travel’ from the starting token i to the ending token j over T time steps. Say we want to see the distribution of tokens at time step t<T. Then we’d write,
pt(k|j,i)=pT−t(j|k)pt(k|i)pT(j|i)=(ejPT−tek)(ekPtei)ejPTei
This shows us that we can break up the conditional generation process into a calculation over transition probabilities. We could write this out for an arbitrary sequence of separated words. From this perspective we’d be training a model to perform a regression over the words being generated. This is the sense in which we already use outlines to effectively create regression data-sets to model arguments.
What would be ideal is to find a way to generalize this to a non-Markovian, preferably deep-learning, setting. This is where I’m stuck at the moment. I’d want to understand where the SOTA is on this. The only options that immediately come to mind seem to be tree-search over tokens or RL. From the regression point of view, it seems like you’d want to try fitting the ‘training data’ such that the likelihood for the result is as high as possible.