For those curious about the performance: eyeballing the technical report, it roughly performs at the level of LLama-3 70B. It seems to have an inferior parameters-to-performance ratio because it was only trained on 9 trillion tokens, while the Llama-3 models were trained on 15 trillion tokens. It’s also trained with a 4k context length as opposed to Llama-3′s 8k. Its primary purpose seems to be the synthetic data pipeline thing.
For those curious about the performance: eyeballing the technical report, it roughly performs at the level of LLama-3 70B. It seems to have an inferior parameters-to-performance ratio because it was only trained on 9 trillion tokens, while the Llama-3 models were trained on 15 trillion tokens. It’s also trained with a 4k context length as opposed to Llama-3′s 8k. Its primary purpose seems to be the synthetic data pipeline thing.