Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?
This is relevant, because with floating point numbers, the order of summation and multiplication can play a role. And I guess with different hardware the calculations are split differently, leading to different execution sequences.
I would also be interested to learn more here, because the cause could also be a memory overflow or something similar.
i’m naive to the details of GPT specifically, but it’s easy to accidentally make any reduction non-deterministic when working with floating point numbers — even before hardware variations.
for example, you want to compute the sum over a 1-billion entry vector where each entry is the number 1. in 32-bit IEEE-754, you should get different results by accumulating linearly (1+(1+(1+…))) vs tree-wise (…((1+1) + (1+1))…).
in practice most implementations do some combination of these. i’ve seen someone do this by batching groups of 100,000 numbers to sum linearly, with each batch dispatched to a different compute unit and the 10,000 results then being summed in a first-come/first-serve manner (e.g. a queue, or even a shared accumulator). then you get slightly different results based on how each run is scheduled (well, the all-1’s case is repeatable with this method but it wouldn’t be with real data).
and then yes, bring in different hardware, and the scope broadens. the optimal batching size (which might be exposed as a default somewhere) changes such that even had you avoided that scheduling-dependent pitfall, you would now see different results than on the earlier hardware. however, you can sometimes tell these possibilities apart! if it’s non-deterministic scheduling, the number of different outputs for the same input is likely higher order than if it’s variation strictly due to differing hardware models. if you can generate 10,000 different outputs from the same input, that’s surely greater than the number of HW models, so it would be better explained by non-deterministic scheduling.
This explanation is basically correct, though it doesn’t have to be different hardware—even different batch sizes can often be sufficient to change the order of summation and multiplication.
Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?
One explanation would be changing hardware, see [this tweet](https://twitter.com/OfirPress/status/1542610741668093952).
This is relevant, because with floating point numbers, the order of summation and multiplication can play a role. And I guess with different hardware the calculations are split differently, leading to different execution sequences.
I would also be interested to learn more here, because the cause could also be a memory overflow or something similar.
i’m naive to the details of GPT specifically, but it’s easy to accidentally make any reduction non-deterministic when working with floating point numbers — even before hardware variations.
for example, you want to compute the sum over a 1-billion entry vector where each entry is the number 1. in 32-bit IEEE-754, you should get different results by accumulating linearly (1+(1+(1+…))) vs tree-wise (…((1+1) + (1+1))…).
in practice most implementations do some combination of these. i’ve seen someone do this by batching groups of 100,000 numbers to sum linearly, with each batch dispatched to a different compute unit and the 10,000 results then being summed in a first-come/first-serve manner (e.g. a queue, or even a shared accumulator). then you get slightly different results based on how each run is scheduled (well, the all-1’s case is repeatable with this method but it wouldn’t be with real data).
and then yes, bring in different hardware, and the scope broadens. the optimal batching size (which might be exposed as a default somewhere) changes such that even had you avoided that scheduling-dependent pitfall, you would now see different results than on the earlier hardware. however, you can sometimes tell these possibilities apart! if it’s non-deterministic scheduling, the number of different outputs for the same input is likely higher order than if it’s variation strictly due to differing hardware models. if you can generate 10,000 different outputs from the same input, that’s surely greater than the number of HW models, so it would be better explained by non-deterministic scheduling.
This explanation is basically correct, though it doesn’t have to be different hardware—even different batch sizes can often be sufficient to change the order of summation and multiplication.