“We’ll be building a cluster of around 22,000 H100s. This is approximately three times more compute than what was used to train all of GPT4.
This bothers me. It’s a naive way of seeing compute. It’s like confusing Watts and Watt-hours
22,000 H100s is three times the amount of FLOP/s than what was used to GPT-4, so you could train it in 3x less time, of with 1⁄3 of your cluster and the same time.
I think this view of looking at compute helps making naive asumptions about what this compute can be used to. And FLOP/s are not a perfect unit for normal discourse when we’re at x10¹⁵ scales.
I dont think this is a good take.
The Cybertruck does not break on that pull. It breaks on this one: 0:27