Some of the experiments are pretty easy to replicate, e.g. checking text-davinci-003’s favorite random number:
Seems much closer to base davinci than to text-davinci-002’s mode collapse.
I tried to replicate some of the other experiments, but it turns out that text-davinci-003 stops answering questions the same way as davinci/text-davinci-002, which probably means that the prompts have to be adjusted. For example, on the “roll a d6” test, text-davinci-003 assigns almost no probability to the numbers 1-6, and a lot of probability on things like X and ____: (you can fix this using logit_bias, but I’m not sure we should trust the relative ratios of incredibly unlikely tokens in the first place.)
While both text-davinci-002 and davinci assign much high probabilities to the numbers than other options, and text-davinci-002 even assigns more than 73% chance to the token 6.
Some of the experiments are pretty easy to replicate, e.g. checking
text-davinci-003
’s favorite random number:Seems much closer to base
davinci
than totext-davinci-002
’s mode collapse.I tried to replicate some of the other experiments, but it turns out that text-davinci-003 stops answering questions the same way as
davinci
/text-davinci-002
, which probably means that the prompts have to be adjusted. For example, on the “roll a d6” test,text-davinci-003
assigns almost no probability to the numbers 1-6, and a lot of probability on things like X and ____: (you can fix this using logit_bias, but I’m not sure we should trust the relative ratios of incredibly unlikely tokens in the first place.)While both
text-davinci-002
anddavinci
assign much high probabilities to the numbers than other options, andtext-davinci-002
even assigns more than 73% chance to the token 6.