Past Account comments on [missing post]

Past Account 10 May 2020 3:45 UTC
1 point
If we’re taking the idea that arguments are paths in topological space seriously, I feel like conditioned language models are going to be really important. We already use outlines to effectively create regression data-sets to model arguments. It seems like modifying GPT-2 so that you can condition on start/end prompts would be incredibly helpful here. More speculative, I think that GPT-2 is near the best we’ll ever get at next word prediction. Humans use outline like thinking much more often then is commonly supposed.
- Past Account 11 May 2020 2:57 UTC
  1 point
  Parent
  I think it’s worth taking a look at what’s out there:
  - SpanBERT
    Uses random spans to do masked pre-training
    Seems to indicate that using longer spans is essentially difficult
  - Distillation of BERT Models
    BERT embeddings are hierarchical
- Past Account 11 May 2020 2:13 UTC
  1 point
  Parent
  Markov and general next-token generators work well when conditioned with text. While some models, such as Bert, are able to predict masked tokens I’m not aware of models that are able to generate the most likely sentence that would sit between a given start/end prompt.
  
  It’s worth working in the Markov setting to get a grounding for what we’re looking for. The core of Markov model is the transition matrix $P_{i j}$ which tells us the conditional likelihood of the token $j$ following immediately after the token $i$ . The rules of conditional probability allow us to write,
  
  $p (k | j, i) = \frac{p (j, k | i)}{p (j | i)} = \frac{p (j | k) p (k | i)}{p (j | i)}$
  
  This gives us the probability of a token $k$ occurring immediately between the start/end prompts. In general we’re interested in what happens if we ‘travel’ from the starting token $i$ to the ending token $j$ over $T$ time steps. Say we want to see the distribution of tokens at time step $t < T$ . Then we’d write,
  
  $p^{t} (k | j, i) = \frac{p^{T - t} (j | k) p^{t} (k | i)}{p^{T} (j | i)} = \frac{(e_{j} P^{T - t} e_{k}) (e_{k} P^{t} e_{i})}{e_{j} P^{T} e_{i}}$
  
  This shows us that we can break up the conditional generation process into a calculation over transition probabilities. We could write this out for an arbitrary sequence of separated words. From this perspective we’d be training a model to perform a regression over the words being generated. This is the sense in which we already use outlines to effectively create regression data-sets to model arguments.
  
  What would be ideal is to find a way to generalize this to a non-Markovian, preferably deep-learning, setting. This is where I’m stuck at the moment. I’d want to understand where the SOTA is on this. The only options that immediately come to mind seem to be tree-search over tokens or RL. From the regression point of view, it seems like you’d want to try fitting the ‘training data’ such that the likelihood for the result is as high as possible.
- Harmless 10 May 2020 11:53 UTC
  1 point
  Parent
  I don’t know if this is already known, but you might be interested in the fact that you can currently use start prompts for GPT-2.
  - Past Account 11 May 2020 2:29 UTC
    1 point
    Parent
    I’m aware of this. I’m slowly piecing together what I’m looking for if you decide to follow this.