I mean, we don’t know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + “high-quality multi-task instruction data”. I wouldn’t be surprised if the same were true of Qwen 1.5.
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these models are utilized to synthesize high-quality pre-training data. (Page 5) [...] Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
Similarly, Gemma 2 had its pretraining corpus filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3) [...] We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~ EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
I mean, we don’t know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + “high-quality multi-task instruction data”. I wouldn’t be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
Similarly, Gemma 2 had its pretraining corpus filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.