Gerald Monroe comments on johnswentworth’s Shortform

Gerald Monroe 15 Feb 2024 23:21 UTC
2 points
0
Did you write more than 7 million words yet @gwern? https://www.google.com/amp/s/blog.google/technology/ai/google-gemini-next-generation-model-february-2024/amp/

Basically it’s the “lazy wait” calculation. Get something to work now or wait until the 700k or 7m word context window ships.
- gwern 16 Feb 2024 0:35 UTC
  3 points
  0
  Parent
  I may have. Just gwern.net is, I think, somewhere around 2m, and it’s not comprehensive. Also, for contradictions, I would want to detect contradictions against citations/references as well (detecting miscitations would be more important than self-consistency IMO), and as a rough ballpark, the current Gwern.net annotation* corpus is approaching 4.3m words, looks like, and is also not comprehensive. So, closer than one might think! (Anyway, doesn’t deal with the cost or latency: as you can see in the demos, we are talking minutes, not seconds, for these million-token calls and the price is probably going to be in the dollar+ regime per call.)
  
  * which are not fulltext. It would be nice to throw in all of the hosted paper & book & webpage fulltexts, but then that’s probably more like 200m+ words.
  - ryan_greenblatt 16 Feb 2024 1:20 UTC
    5 points
    0
    Parent
    
    minutes
    
    There isn’t any clear technical obstruction to getting this time down pretty small with more parallelism.
    - gwern 16 Feb 2024 3:31 UTC
      2 points
      −2
      Parent
      There may not be any ‘clear’ technical obstruction, but it has failed badly in the past. ‘Add more parallelism’ (particularly hierarchically) is one of the most obvious ways to improve attention, and people have spent the past 5 years failing to come up with efficient attentions that do anything but move along a Pareto frontier from ‘fast but doesn’t work’ to ‘slow and works only as well as the original dense attention’. It’s just inherently difficult to know what tokens you will need across millions of tokens without input from all the other tokens (unless you are psychic), implying extensive computation of some sort, which makes things inherently serial and costs you latency, even if you are rich enough to spend compute like water. You’ll note that when Claude-2 was demoing the ultra-long attention windows, it too spent a minute or two churning. While the most effective improvements in long-range attention like Flash Attention or Ring Attention are just hyperoptimizing dense attention, which is inherently limited.