calef comments on Christiano, Cotra, and Yudkowsky on AI progress

calef 26 Nov 2021 7:36 UTC
13 points
I believe Sam Altman implied they’re simply training a GPT-3-variant for significantly longer for “GPT-4”. The GPT-3 model in prod is nowhere near converged on its training data.

Edit: changed to be less certain, pretty sure this follows from public comments by Sam, but he has not said this exactly
- Lukas Finnveden 26 Nov 2021 9:54 UTC
  6 points
  Parent
  Say more about the source for this claim? I’m pretty sure he didn’t say that during the Q&A I’m sourcing my info from. And my impression is that they’re doing something more than this, both on priors (scaling laws says that optimal compute usage means you shouldn’t train to convergence — why would they start now?) and based on what he said during that Q&A.
  - calef 26 Nov 2021 19:41 UTC
    12 points
    Parent
    This is based on:
    
    The Q&A you mention
    GPT-3 not being trained on even one pass of its training dataset
    “Use way more compute” achieving outsized gains by training longer than by most other architectural modifications for a fixed model size (while you’re correct that bigger model = faster training, you’re trading off against ease of deployment, and models much bigger than GPT-3 become increasingly difficult to serve at prod. Plus, we know it’s about the same size, from the Q&A)
    Some experience with undertrained enormous language models underperforming relative to expectation
    
    This is not to say that GPT-4 wont have architectural changes. Sam mentioned a longer context at the least. But these sorts of architectural changes probably qualify as “small” in the parlance of the above conversation.
    - Lukas Finnveden 26 Nov 2021 20:04 UTC
      5 points
      Parent
      To be clear: Do you remember Sam Altman saying that “they’re simply training a GPT-3-variant for significantly longer”, or is that an inference from ~”it will use a lot more compute” and ~”it will not be much bigger”?
      Because if you remember him saying that, then that contradicts my memory (and, uh, the notes that people took that I remember reading), and I’m confused.
      While if it’s an inference: sure, that’s a non-crazy guess, and I take your point that smaller models are easier to deploy. I just want it to be flagged as a claimed deduction, not as a remembered statement.
      (And I maintain my impression that something more is going on; especially since I remember Sam generally talking about how models might use more test-time compute in the future, and be able to think for longer on harder questions.)
      - calef 26 Nov 2021 20:10 UTC
        4 points
        Parent
        Honestly, at this point, I don’t remember if it’s inferred or primary-sourced. Edited the above for clarity.