Tomás B. comments on Are we in an AI overhang?

Tomás B. 27 Jul 2020 14:53 UTC
LW: 9 AF: 3
AF
One thing we have to account for is advances architecture even in a world where Moore’s law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an “architecture overhang” exists. One frame to view this through is in era the of Moore’s law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 years.
I don’t know how seriously to take the AI ASIC people, but they are claiming very large increases in capability, on the order of 100-1000x in the next 10 years, if this is a true this is a multiplier on top of increased investment. See this response from a panel including big-wigs at NVIDIA, Google, and Cerebras about projected capabilities: https://youtu.be/E__85F_vnmU?t=4016. On top of this, one has to account, too, for algorithmic advancement: https://openai.com/blog/ai-and-efficiency/
Another thing to note is though by parameter count, the largest modern models are 10000x smaller than the human brain, if one buys that parameter >= synapse idea (which most don’t but is not entirely off the table), the temporal resolution is far higher. So once we get human-sized models, they may be trained almost comically faster than human minds are. So on top an architecture overhang we may have this “temporal resolution overhang”, too, where once models are as powerful as the human brain they will almost certainly be trained much faster. And on top of this there is an “inference overhang” where because inference is much, much cheaper than training, once you are done training an economically useful model, you will almost tautologically have a lot of compute to exploit it with.
Hopefully I am just being paranoid (I am definitely more of a squib than a wizard in these domains), but I am seeing overhangs everywhere!
- gwern 27 Jul 2020 15:34 UTC
  LW: 27 AF: 8
  AF Parent
  
  As an aside, though it’s not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I’m misunderstanding something.
  
  The GPT architecture isn’t even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT’s context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.
  
  Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.
  
  (The problem there is the tradeoff between compute efficiency of training and better architectures. It’s not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drastically parameter-inefficient—but efficient to train on GPUs! - mixture-of-expert small Transformers approach.)
- Veedrac 27 Jul 2020 16:25 UTC
  LW: 21 AF: 7
  AF Parent
  Moore’s Law is not dead. I could rant about the market dynamics that made people think otherwise, but it’s easier just to point to the data.
  https://docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrhppW7PT6GCfnrVGhxhLA5PVw
  Moore’s Law might die in the short future, but I’ve yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero’s NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
  Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it’s probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it’s much harder to say they will all fail.
  - Tomás B. 11 Mar 2021 15:27 UTC
    4 points
    Parent
    Your estimates of hardware advancement seem higher than most people’s. I’ve enjoyed your comments on such things and think there should be a high-level, full length post on them, especially with widely respected posts claiming much longer times until human-level hardware.Would be willing to subsidize such a thing if you are interested. Would pay 500 USD to yourself or a charity of your choice for a post on the potential of ASICS, Moore’s law, how quickly we can overcome the memory bandwidth bottlenecks and such things. Would also subsidize a post estimating an answer this question, too: https://www.lesswrong.com/posts/7htxRA4TkHERiuPYK/parameter-vs-synapse
    - Veedrac 17 Mar 2021 20:38 UTC
      1 point
      Parent
      There’s a lot worth saying on these topics, I’ll give it a go.
      - Tomás B. 3 Apr 2021 15:19 UTC
        1 point
        Parent
        Just posting in case you did not get my PM. It has my email in it.
        Veedrac 7 Apr 2021 22:31 UTC
        2 points
        Parent
        Thanks, I did get the PM.
        Evenflair 24 Aug 2021 22:57 UTC
        3 points
        Parent
        Was this ever posted?
        Tomás B. 11 Dec 2021 17:15 UTC
        4 points
        Parent
        Now posted: https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress
        Veedrac 25 Aug 2021 14:55 UTC
        3 points
        Parent
        No, sorry.
        gwern 25 Aug 2021 16:59 UTC
        6 points
        Parent
        Might be worth getting around to it:
        
        https://semianalysis.com/tesla-dojo-ai-super-computer-unique-packaging-and-chip-design-allow-an-order-magnitude-advantage-over-competing-ai-hardware/
        
        https://spectrum.ieee.org/cerebras-ai-computers https://www.servethehome.com/cerebras-wafer-scale-engine-2-wse-2-at-hot-chips-33/
        
        ‘“From talking to OpenAI, GPT-4 will be about 100 trillion parameters,” Feldman says. “That won’t be ready for several years.”’
        
        Tomás B. 11 Dec 2021 17:15 UTC
        4 points
        Parent
        Now posted: https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress
  - maximkazhenkov 27 Jul 2020 18:56 UTC
    4 points
    AF Parent
    Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
    - Veedrac 27 Jul 2020 20:24 UTC
      LW: 10 AF: 3
      AF Parent
      Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.