Well, it will scale linearly until it hits the finite node-to-node bandwidth limit… just like all other supercomputers. If you have your model training on n different nodes, you still need to share all your weights with all other nodes at some point, which is fundamentally an O(n2) operation, it just appears linear when you’re spending more time computing your weight updates than you are communicating with other nodes. I don’t see this really being a qualitative jump, but it might well be one more point to add to the graph of increasing compute power dedicated to AI.
Hmm, I see how that would happen with other architectures, but I’m a bit confused how this is O(n2) here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn’t this be a one-to-many broadcast with O(n) transmission time?
Well, it will scale linearly until it hits the finite node-to-node bandwidth limit… just like all other supercomputers. If you have your model training on n different nodes, you still need to share all your weights with all other nodes at some point, which is fundamentally an O(n2) operation, it just appears linear when you’re spending more time computing your weight updates than you are communicating with other nodes. I don’t see this really being a qualitative jump, but it might well be one more point to add to the graph of increasing compute power dedicated to AI.
Hmm, I see how that would happen with other architectures, but I’m a bit confused how this is O(n2) here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn’t this be a one-to-many broadcast with O(n) transmission time?
You’re completely right, I don’t know how I missed that, I must be more tired than I thought I was.