Are these really pure base models? I’ve also noticed this kind of behaviour in so-called base models. My conclusion is that they are not base models in the sense that they have been trained to predict the next word on the internet, but they have undergone some safety fine-tuning before release. We don’t actually know how they were trained, and I am suspicious. It might be best to test on Pythia or some model where we actually know.
I mean, we don’t know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + “high-quality multi-task instruction data”. I wouldn’t be surprised if the same were true of Qwen 1.5.
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these models are utilized to synthesize high-quality pre-training data. (Page 5) [...] Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
Similarly, Gemma 2 had its pretraining corpus filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3) [...] We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~ EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
LLaMA 1 7B definitely seems to be a “pure base model”. I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I’ll add this as a limitation, thanks!
I’ve checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here’s an example, using the following prompt template:
“”″User: {instruction}
Assistant:”″”
It’s pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of “incompetent refusals”, similar to LLaMA 1 7B:
“”″A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.
I wonder why this happens?! I can’t find any examples of this in the pile, and their filtering doesn’t seem to add it. It’s hard to imagine humans generating this particular refusal response. Perhaps it’s a result of filtering, s
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.
Are these really pure base models? I’ve also noticed this kind of behaviour in so-called base models. My conclusion is that they are not base models in the sense that they have been trained to predict the next word on the internet, but they have undergone some safety fine-tuning before release. We don’t actually know how they were trained, and I am suspicious. It might be best to test on Pythia or some model where we actually know.
I mean, we don’t know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + “high-quality multi-task instruction data”. I wouldn’t be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
Similarly, Gemma 2 had its pretraining corpus filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
LLaMA 1 7B definitely seems to be a “pure base model”. I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I’ll add this as a limitation, thanks!
I’ve checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here’s an example, using the following prompt template:
“”″User: {instruction}
Assistant:”″”
It’s pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of “incompetent refusals”, similar to LLaMA 1 7B:
“”″A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.
USER: {instruction}
ASSISTANT:”″”
Interesting, that changes my mind somewhat.
I wonder why this happens?! I can’t find any examples of this in the pile, and their filtering doesn’t seem to add it. It’s hard to imagine humans generating this particular refusal response. Perhaps it’s a result of filtering, s
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.
Indeed. The base LLM would likely predict a “henchman” to be a lot less scrupulous than an “assistant”.