GPT-2 works by deterministically fetching the probability distribution over the next token, then sampling from it. It is plausible that the probability it assigns to 6 is no larger than 80%, but it’s simple enough to postprocess every probability larger than 50% to 100%. (This isn’t always done because when completing a list prefix of size 4, it would always produce an infinite list, because the probability of another , is more than 50%.)
GPT-2 works by deterministically fetching the probability distribution over the next token, then sampling from it. It is plausible that the probability it assigns to 6 is no larger than 80%, but it’s simple enough to postprocess every probability larger than 50% to 100%. (This isn’t always done because when completing a list prefix of size 4, it would always produce an infinite list, because the probability of another , is more than 50%.)