OK, what I actually did was not realize that the link provided did not link directly to gpt2-chatbot (instead, the front page just compares two random chatbots from a list). After figuring that out, I reran my tests; it was able to do 20, 40, and 100 numbers perfectly.
As for one more test, it was rather close on reversing 400 numbers:
Given these results, it seems pretty obvious that this is a rather advanced model (although Claude Opus was able to do it perfectly, so it may not be SOTA).
Going back to the original question of where this model came from, I have trouble putting the chance of this necessarily coming from OpenAI above 50%, mainly due to questions about how exactly this was publicized. It seems to be a strange choice to release an unannounced model in Chatbot Arena, especially without any sort of associated update on GitHub for the model (which would be in https://github.com/lm-sys/FastChat/blob/851ef88a4c2a5dd5fa3bcadd9150f4a1f9e84af1/fastchat/model/model_registry.py#L228 ). However, I think I still have some pretty large error margins, given how little information I can really find.
Nah, it’s just a PR stunt. Remember when DeepMind released AlphaGo Master by simply running a ‘Magister’ Go player online which went undefeated?* Everyone knew it was DeepMind simply because who else could it be? And IIRC, didn’t OA also pilot OA5 ‘anonymously’ on DoTA2 ladders? Or how about when Mistral released torrents? (If they had really wanted a blind test, they wouldn’t’ve called it “gpt2”, or they could’ve just rolled it out to a subset of ChatGPT users, who would have no way of knowing the model underneath the interface had been swapped out.)
* One downside of that covert testing: DM AFAIK never released a paper on AG Master, or all the complicated & interesting things they were trying before they hit upon the AlphaZero approach.
OK, what I actually did was not realize that the link provided did not link directly to gpt2-chatbot (instead, the front page just compares two random chatbots from a list). After figuring that out, I reran my tests; it was able to do 20, 40, and 100 numbers perfectly.
I’ve retracted my previous comments.
As for one more test, it was rather close on reversing 400 numbers:
Given these results, it seems pretty obvious that this is a rather advanced model (although Claude Opus was able to do it perfectly, so it may not be SOTA).
Going back to the original question of where this model came from, I have trouble putting the chance of this necessarily coming from OpenAI above 50%, mainly due to questions about how exactly this was publicized. It seems to be a strange choice to release an unannounced model in Chatbot Arena, especially without any sort of associated update on GitHub for the model (which would be in https://github.com/lm-sys/FastChat/blob/851ef88a4c2a5dd5fa3bcadd9150f4a1f9e84af1/fastchat/model/model_registry.py#L228 ). However, I think I still have some pretty large error margins, given how little information I can really find.
Nah, it’s just a PR stunt. Remember when DeepMind released AlphaGo Master by simply running a ‘Magister’ Go player online which went undefeated?* Everyone knew it was DeepMind simply because who else could it be? And IIRC, didn’t OA also pilot OA5 ‘anonymously’ on DoTA2 ladders? Or how about when Mistral released torrents? (If they had really wanted a blind test, they wouldn’t’ve called it “gpt2”, or they could’ve just rolled it out to a subset of ChatGPT users, who would have no way of knowing the model underneath the interface had been swapped out.)
* One downside of that covert testing: DM AFAIK never released a paper on AG Master, or all the complicated & interesting things they were trying before they hit upon the AlphaZero approach.