I’d like to make a quite systematic comparison of openAi’s chatbot performances in French and English. After a couple days trying things I feel like it is much weaker in French—which seems logical as it has much less data in French. I would like to explore that theory, so if you have interesting prompts you would like me to test let me know !
How systematic are we talking here? At research-paper level, BIG-Bench (https://arxiv.org/pdf/2206.04615.pdf) (https://github.com/google/BIG-bench) is a good metric, but even testing one of those benchmarks, let alone a good subset of them (Like BIG-Bench Hard) would require a lot of dataset translation, and would also require chain-of-thought prompting to do well. (Admittedly, I would also be curious to see how well the model does when self-translating instructions from English to French or vice-versa, then following instructions. Could GPT actually do better if it translates French to English and then does the prompt, vs. just doing it in French?)
Even if you’re just playing around though, BIG-Bench should give you a lot of ideas.
Sadly both my time and capacity are limited to “try some prompts around to get a feeling of what the results look like.” I may do more if the results are actually interesting.
One of the first tasks I tested was actually to write essays in English with a prompt in French, which it did very well, I would say better than when asked for an essay in French. I’ve not looked at the inverse task though (prompt in English for essay in French).
I’ll probably translate the prompts through DeepL with a bit of supervision and analyse the results using a thoroughly scientific “my gut feeling” with maybe some added “my mother’s expertise”.
I’d like to make a quite systematic comparison of openAi’s chatbot performances in French and English. After a couple days trying things I feel like it is much weaker in French—which seems logical as it has much less data in French. I would like to explore that theory, so if you have interesting prompts you would like me to test let me know !
How systematic are we talking here? At research-paper level, BIG-Bench (https://arxiv.org/pdf/2206.04615.pdf) (https://github.com/google/BIG-bench) is a good metric, but even testing one of those benchmarks, let alone a good subset of them (Like BIG-Bench Hard) would require a lot of dataset translation, and would also require chain-of-thought prompting to do well. (Admittedly, I would also be curious to see how well the model does when self-translating instructions from English to French or vice-versa, then following instructions. Could GPT actually do better if it translates French to English and then does the prompt, vs. just doing it in French?)
Even if you’re just playing around though, BIG-Bench should give you a lot of ideas.
Sadly both my time and capacity are limited to “try some prompts around to get a feeling of what the results look like.” I may do more if the results are actually interesting.
One of the first tasks I tested was actually to write essays in English with a prompt in French, which it did very well, I would say better than when asked for an essay in French. I’ve not looked at the inverse task though (prompt in English for essay in French).
I’ll probably translate the prompts through DeepL with a bit of supervision and analyse the results using a thoroughly scientific “my gut feeling” with maybe some added “my mother’s expertise”.