This means the smallest available positive value must be used, and so if two tokens’ logits are sufficiently close, multiple distinct outputs may be seen if the same prompt is repeated enough times at “zero” temperature.
I don’t think this is the cause of OpenAI API nondeterminism at temperature 0.
If I make several API calls at temperature 0 with the same prompt, and inspect the logprobs in the response, I notice that
The sampled token is always the one with the highest logprob
The logprobs themselves are slightly different between API calls
That is, we are deterministically sampling from the model’s output logits in the expected way (matching the limit of temperature sampling as T → 0), but the model’s output logits are not themselves deterministic.
I don’t think we know for sure why the model is non-deterministic. However, it’s not especially surprising: on GPUs there is often a tradeoff between determinism and efficiency, and OpenAI is trying hard to run their models as efficiently as possible.
I don’t think this is the cause of OpenAI API nondeterminism at temperature 0.
If I make several API calls at temperature 0 with the same prompt, and inspect the logprobs in the response, I notice that
The sampled token is always the one with the highest logprob
The logprobs themselves are slightly different between API calls
That is, we are deterministically sampling from the model’s output logits in the expected way (matching the limit of temperature sampling as T → 0), but the model’s output logits are not themselves deterministic.
I don’t think we know for sure why the model is non-deterministic. However, it’s not especially surprising: on GPUs there is often a tradeoff between determinism and efficiency, and OpenAI is trying hard to run their models as efficiently as possible.