It’s extremely unlikely that two biggest logits have the exact same value — there are still a lot of floating point numbers even with float16!!
The reason there’s no determinism is because of a combination of lower precision and nondeterministic reduce operations (eg sums). For example, the order in which terms are accumulated can vary with the batch size, which, for models as large as GPT3, can make the logits vary by up to 1%.
Oh interesting didn’t realise there was so much nondeterminism for sums on GPUs
I guess I thought that there’s only 65k float 16s and the two highest ones are going to be chosen from a much smaller range from that 65k just because they have to be bigger than everything else.
It’s extremely unlikely that two biggest logits have the exact same value — there are still a lot of floating point numbers even with float16!!
The reason there’s no determinism is because of a combination of lower precision and nondeterministic reduce operations (eg sums). For example, the order in which terms are accumulated can vary with the batch size, which, for models as large as GPT3, can make the logits vary by up to 1%.
Oh interesting didn’t realise there was so much nondeterminism for sums on GPUs
I guess I thought that there’s only 65k float 16s and the two highest ones are going to be chosen from a much smaller range from that 65k just because they have to be bigger than everything else.