Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the “crimes” part is hard to stop but the “paying for compute” part is relatively easy to stop.
Both legitimately and illegitimately acquiring compute could plausibly be the best route. I’m uncertain.
It doesn’t seem that easy to lock down legitimate compute to me? I think you need to shutdown people buying/renting 8xH100 style boxes. This seems quite difficult potentially.
The model weights probably don’t fit in VRAM in a single 8xH100 device (the model weights might be 5 TB when 4 bit quantized), but you can maybe fit with 8-12 of these. And, you might not need amazing interconnect (e.g. normal 1 GB/s datacenter internet is fine) for somewhat high latency (but decently good throughput) pipeline parallel inference. (You only need to send the residual stream per token.) I might do a BOTEC on this later. Unless you’re aware of a BOTEC on this?
I did some BOTECs on this and think 1 GB/s is sort of borderline, probably works but not obviously.
E.g. I assumed a ~10TB at fp8 MoE model with a sparsity factor of 4 with 32768 hidden size.
With 32kB per token you could send at most 30k tokens/second over a 1GB/s interconnect. Not quite sure what a realistic utilization would be, but maybe we halve that to 15k?
If the model was split across 20 8xH100 boxes, then each box might do ~250 GFLOP/token (2 * 10T parameters / (4*20)), so each box would do at most 3.75 PFLOP/second, which might be about ~20-25% utilization.
This is not bad, but for a model with much more sparsity or GPUs with a different FLOP/s : VRAM ratio or spottier connection etc. the bandwidth constraint might become quite harsh.
(the above is somewhat hastily reconstructed from some old sheets, might have messed something up)
Both legitimately and illegitimately acquiring compute could plausibly be the best route. I’m uncertain.
It doesn’t seem that easy to lock down legitimate compute to me? I think you need to shutdown people buying/renting 8xH100 style boxes. This seems quite difficult potentially.
The model weights probably don’t fit in VRAM in a single 8xH100 device (the model weights might be 5 TB when 4 bit quantized), but you can maybe fit with 8-12 of these. And, you might not need amazing interconnect (e.g. normal 1 GB/s datacenter internet is fine) for somewhat high latency (but decently good throughput) pipeline parallel inference. (You only need to send the residual stream per token.) I might do a BOTEC on this later. Unless you’re aware of a BOTEC on this?
I did some BOTECs on this and think 1 GB/s is sort of borderline, probably works but not obviously.
E.g. I assumed a ~10TB at fp8 MoE model with a sparsity factor of 4 with 32768 hidden size.
With 32kB per token you could send at most 30k tokens/second over a 1GB/s interconnect. Not quite sure what a realistic utilization would be, but maybe we halve that to 15k?
If the model was split across 20 8xH100 boxes, then each box might do ~250 GFLOP/token (2 * 10T parameters / (4*20)), so each box would do at most 3.75 PFLOP/second, which might be about ~20-25% utilization.
This is not bad, but for a model with much more sparsity or GPUs with a different FLOP/s : VRAM ratio or spottier connection etc. the bandwidth constraint might become quite harsh.
(the above is somewhat hastily reconstructed from some old sheets, might have messed something up)
for reference, just last week i rented 3 8xh100 boxes without any KYC