Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.
Even if you might have trouble networking a 100x larger system together for training, you can train the smaller network 100x and stitch answers together using ensemble methods, and make decent use of the extra compute. It may not be as good as growing the network that full factor, but if you have extra compute beyond the cap of whatever connected-enough training system size you can muster, there are worse ways to spend it.
I am somewhat more prone to think that more selective attention (e.g. Big Bird’s block-random attention model) could bring down the quadratic cost of the window size quickly enough to be a factor here. Replacing a quadratic term with a linear or n log n or heck even a n^1.85 term goes a long way when billions are on the table.
Even if you might have trouble networking a 100x larger system together for training, you can train the smaller network 100x and stitch answers together using ensemble methods, and make decent use of the extra compute. It may not be as good as growing the network that full factor, but if you have extra compute beyond the cap of whatever connected-enough training system size you can muster, there are worse ways to spend it.
I am somewhat more prone to think that more selective attention (e.g. Big Bird’s block-random attention model) could bring down the quadratic cost of the window size quickly enough to be a factor here. Replacing a quadratic term with a linear or n log n or heck even a n^1.85 term goes a long way when billions are on the table.