It also appears to break determinism in the playground at temperature 0, which shouldn’t happen.
This happens consistently with both the API and the playground on natural prompts too — it seems that OpenAI is just using low enough precision on forward passes that the probability of high probability tokens can vary by ~1% per call.
Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?
This is relevant, because with floating point numbers, the order of summation and multiplication can play a role. And I guess with different hardware the calculations are split differently, leading to different execution sequences.
I would also be interested to learn more here, because the cause could also be a memory overflow or something similar.
i’m naive to the details of GPT specifically, but it’s easy to accidentally make any reduction non-deterministic when working with floating point numbers — even before hardware variations.
for example, you want to compute the sum over a 1-billion entry vector where each entry is the number 1. in 32-bit IEEE-754, you should get different results by accumulating linearly (1+(1+(1+…))) vs tree-wise (…((1+1) + (1+1))…).
in practice most implementations do some combination of these. i’ve seen someone do this by batching groups of 100,000 numbers to sum linearly, with each batch dispatched to a different compute unit and the 10,000 results then being summed in a first-come/first-serve manner (e.g. a queue, or even a shared accumulator). then you get slightly different results based on how each run is scheduled (well, the all-1’s case is repeatable with this method but it wouldn’t be with real data).
and then yes, bring in different hardware, and the scope broadens. the optimal batching size (which might be exposed as a default somewhere) changes such that even had you avoided that scheduling-dependent pitfall, you would now see different results than on the earlier hardware. however, you can sometimes tell these possibilities apart! if it’s non-deterministic scheduling, the number of different outputs for the same input is likely higher order than if it’s variation strictly due to differing hardware models. if you can generate 10,000 different outputs from the same input, that’s surely greater than the number of HW models, so it would be better explained by non-deterministic scheduling.
This explanation is basically correct, though it doesn’t have to be different hardware—even different batch sizes can often be sufficient to change the order of summation and multiplication.
I might be missing something but why does temperature 0 imply determinism? Neural nets don’t work with real numbers, they work with floating points numbers so despitetemperature 0 implying an argmax there’s no reason there arent justmultiple maxima. AFAICT GPT3 uses half precision floating point numbers so there’s quite a lot of space for collisions.
It’s extremely unlikely that two biggest logits have the exact same value — there are still a lot of floating point numbers even with float16!!
The reason there’s no determinism is because of a combination of lower precision and nondeterministic reduce operations (eg sums). For example, the order in which terms are accumulated can vary with the batch size, which, for models as large as GPT3, can make the logits vary by up to 1%.
Oh interesting didn’t realise there was so much nondeterminism for sums on GPUs
I guess I thought that there’s only 65k float 16s and the two highest ones are going to be chosen from a much smaller range from that 65k just because they have to be bigger than everything else.
I noticed this happening with goose.ai’s API as well, using the gpt-neox model, which suggests that the cause of the nondeterminism isn’t unique to OpenAI’s setup.
This happens consistently with both the API and the playground on natural prompts too — it seems that OpenAI is just using low enough precision on forward passes that the probability of high probability tokens can vary by ~1% per call.
Could you say more about why this happens? Even if the parameters or activations are stored in low precision formats, I would think that the same low precision number would be stored every time. Are the differences between forward passes driven by different hardware configurations or software choices in different instances of the model, or what am I missing?
One explanation would be changing hardware, see [this tweet](https://twitter.com/OfirPress/status/1542610741668093952).
This is relevant, because with floating point numbers, the order of summation and multiplication can play a role. And I guess with different hardware the calculations are split differently, leading to different execution sequences.
I would also be interested to learn more here, because the cause could also be a memory overflow or something similar.
i’m naive to the details of GPT specifically, but it’s easy to accidentally make any reduction non-deterministic when working with floating point numbers — even before hardware variations.
for example, you want to compute the sum over a 1-billion entry vector where each entry is the number 1. in 32-bit IEEE-754, you should get different results by accumulating linearly (1+(1+(1+…))) vs tree-wise (…((1+1) + (1+1))…).
in practice most implementations do some combination of these. i’ve seen someone do this by batching groups of 100,000 numbers to sum linearly, with each batch dispatched to a different compute unit and the 10,000 results then being summed in a first-come/first-serve manner (e.g. a queue, or even a shared accumulator). then you get slightly different results based on how each run is scheduled (well, the all-1’s case is repeatable with this method but it wouldn’t be with real data).
and then yes, bring in different hardware, and the scope broadens. the optimal batching size (which might be exposed as a default somewhere) changes such that even had you avoided that scheduling-dependent pitfall, you would now see different results than on the earlier hardware. however, you can sometimes tell these possibilities apart! if it’s non-deterministic scheduling, the number of different outputs for the same input is likely higher order than if it’s variation strictly due to differing hardware models. if you can generate 10,000 different outputs from the same input, that’s surely greater than the number of HW models, so it would be better explained by non-deterministic scheduling.
This explanation is basically correct, though it doesn’t have to be different hardware—even different batch sizes can often be sufficient to change the order of summation and multiplication.
Good to know. Thanks!
I might be missing something but why does temperature 0 imply determinism? Neural nets don’t work with real numbers, they work with floating points numbers so despitetemperature 0 implying an argmax there’s no reason there arent justmultiple maxima. AFAICT GPT3 uses half precision floating point numbers so there’s quite a lot of space for collisions.
It’s extremely unlikely that two biggest logits have the exact same value — there are still a lot of floating point numbers even with float16!!
The reason there’s no determinism is because of a combination of lower precision and nondeterministic reduce operations (eg sums). For example, the order in which terms are accumulated can vary with the batch size, which, for models as large as GPT3, can make the logits vary by up to 1%.
Oh interesting didn’t realise there was so much nondeterminism for sums on GPUs
I guess I thought that there’s only 65k float 16s and the two highest ones are going to be chosen from a much smaller range from that 65k just because they have to be bigger than everything else.
Can confirm I consistently had non-deterministic temp-0 completions on older davinci models accessed through the API last year.
I noticed this happening with goose.ai’s API as well, using the gpt-neox model, which suggests that the cause of the nondeterminism isn’t unique to OpenAI’s setup.