A technical report of InternLM on 6⁄7. It consisted of 104 billion parameters, was trained on 1.6 trillion tokens, and was fine-tuned for performance in Chinese.
The authors claimed that it performed second-best on the Chinese language benchmark C-Eval, right after GPT4. In addition, it performed at the level of GPT3.5 in one-shot MMLU. A version fine-tuned for programming also performed similarly to GPT3.5 in coding benchmarks like HumanEval.
Notable takeaways:
Significant effort was put into parallelization to help evade the US chip ban. I don’t know how impressive this actually is.
It achieved GPT3.5-level performance with similar-ish levels of compute and data. The China-America algorithmic gap is shrinking.
My gut feeling is that the model was very specifically fine-tuned for performing well on standardized tests, especially those in Chinese (GK/Gao Kao is the Chinese College entrance exam). It was also consistently bad with math.
InternLM—China’s Best (Unverified)
A technical report of InternLM on 6⁄7. It consisted of 104 billion parameters, was trained on 1.6 trillion tokens, and was fine-tuned for performance in Chinese.
The authors claimed that it performed second-best on the Chinese language benchmark C-Eval, right after GPT4. In addition, it performed at the level of GPT3.5 in one-shot MMLU. A version fine-tuned for programming also performed similarly to GPT3.5 in coding benchmarks like HumanEval.
Notable takeaways:
Significant effort was put into parallelization to help evade the US chip ban. I don’t know how impressive this actually is.
It achieved GPT3.5-level performance with similar-ish levels of compute and data. The China-America algorithmic gap is shrinking.
My gut feeling is that the model was very specifically fine-tuned for performing well on standardized tests, especially those in Chinese (GK/Gao Kao is the Chinese College entrance exam). It was also consistently bad with math.