Jon Garcia comments on Four usages of “loss” in AI

Jon Garcia 3 Oct 2022 19:38 UTC
LW: 2 AF: 2
1
AF
Well, then computing $ℓ_{n o v e l}$ would just take a really long time.
So, it’s not impossible in principle if you trained the loss function as I suggested (loss function trained by reinforcement learning, then applied to train the actual novel-generating model), but it is a totally impractical approach.
If you really wanted to teach an AI to generate good novels, you’d probably start by training a LLM to imitate existing novels through some sort of predictive loss (e.g., categorical cross-entropy on next-token prediction) to give it a good prior. Then train another LLM to predict reader reviews or dissertations written by literary grad students, using the novels they’re based on as inputs, again with a similar predictive loss. (Pretraining both LLMs on some large corpus (as with GPT) could probably help with providing necessary cultural context.) At the same time, use a Mechanical Turk to get thousands of people to rate the sentiment of every review/dissertation, then train another LLM to predict the sentiment scores of all raters (or a low-dimensional projection of all their ratings), using the reviews/dissertations as input and something like MSE loss to predict sentiment scores as output. Then chain these latter two networks together to compute $ℓ_{n o v e l}$ , to act as the prune to the first network’s babble, and train to convergence.
Honestly, though, I probably still wouldn’t trust the resulting system to produce good novels (or at least not with internally consistent plots, characterizations, and themes) if the LLMs were based on a Transformer architecture.
- Alex Flint 3 Oct 2022 21:28 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  Honestly, though, I probably still wouldn’t trust the resulting system [...] if the LLMs were based on a Transformer architecture.
  
  Interesting—why is that?
  - Jon Garcia 13 Oct 2022 4:57 UTC
    1 point
    0
    Parent
    Mostly due to the limited working memory that Transformers typically use (e.g., a buffer of only the most recent 512 tokens feeding into the decoder). When humans write novels, they have to keep track of plot points, character sheets, thematic arcs, etc. across tens of thousands of words. You could probably get it to work, though, if you augmented the LLM with content-addressable memory and included positional encoding that is aware of where in the novel (percentage-wise) each token resides.