Most people who are trying CodeLlama 70B are not actually using its unusual prompt templating, because it’s not supported in common open source tools for interacting with models. This is corrupting the results to a degree (though curiously not the smaller models).
Smaller Qwen models are pretty good, at least.
Local models are chosen for some business applications dealing with sensitive data that you don’t want to send over API, so Mistral may have decided to make a local model for a customer. Especially with it being tuned on a Llama 70B, this makes sense to me? They would need to have (a form of) the model to run inference locally. Miqu responds similarly to Mistral-Medium but is apparently less good at tasks that take more intelligence, which you’d expect if it’s a less capable model trained on the same dataset.
Fine-tuning a Llama2-70B for a customer use case that needs local functionality hopefully isn’t that unsafe, because that’s pretty common and reasonable. No one especially wants to send medical records to the cloud. But obviously the person you’re fine-tuning for can leak whatever they get if they want.
Prompt engineering is most helpful when you have an LLM step as part of an algorithm, and you want to get optimal outputs repeatedly.
It’s also helpful when working with different forms of model, but also unique by model—trying to get capability out of 7B models benefits from examples in one way, slightly larger models a different way, out of Mixtral in a different way, and out of smarter models like GPT-4 or Mistral-Medium in a different way; some prompting techniques that work with one form actively impede another. Few-shot prompting is great for completion models; one-shot examples greatly helped 7B or 10.7B models I know, but additional examples actively impede them in the use case I was testing; telling it not to do something helped the 7B but impeded the 10.7B, Mixtral is better at just following counterfactual instructions while smart anti-hallucinating models desperately want to correct you …
Based on Mensch’s response, Miqu is probably continued pretraining starting at Llama2-70B, a process similar to how CodeLlama or Llemma were trained. (Training on large datasets comparable with the original pretraining dataset is usually not called fine-tuning.)
less capable model trained on the same dataset
If Miqu underwent continued pretraining from Llama2-70B, the dataset won’t be quite the same, unless mistral-medium is also pretrained after Llama2-70B (in which case it won’t be released under Apache 2).
Hmm. Not sure how relevant here, but do we currently have any good terms to distinguish full-tuning the model in line with the original method of pretraining, and full layer LoRA adaptations that ‘effectively’ continue pretraining but are done in a different manner? I’ve seen it can be used for continued pretraining as well as finetuning, but I don’t know if I’d actually call it a full tune, and I don’t think it has the same expenses. I’m honestly unsure what distinguishes a pretraining LoRA from a fine-tuning LoRA.
Even if the dataset is a bit different between miqu and mistral-medium, they apparently have quite similar policies, and continued pretraining would push it even more to the new dataset than fine-tuning to my understanding.
Right, a probable way of doing continued pretraining could as well be called “full-tuning”, or just “tuning” (which is what you said, not “fine-tuning”), as opposed to “fine-tuning” that trains fewer weights. Though people seem unsure about “fine-tuning” implying that it’s not full-tuning, resulting in terms like dense fine-tuning to mean full-tuning.
good terms to distinguish full-tuning the model in line with the original method of pretraining, and full layer LoRA adaptations that ‘effectively’ continue pretraining but are done in a different manner
You mean like ReLoRA, where full rank pretraining is followed by many batches of LoRA that get baked in? Fine-pretraining :-) Feels like a sparsity-themed training efficiency technique, which doesn’t lose it centrality points in being used for “pretraining”. To my mind, tuning is cheaper adaptation, things that use OOMs less data than pretraining (even if it’s full-tuning). So maybe the terms tuning/pretraining should be defined by the role those parts of the training play in the overall process rather than by the algorithms involved? This makes fine-tuning an unnecessarily specific term, claiming that it’s both tuning and trains fewer weights.
If you wanted a term which would be less confusing than calling continued pretraining ‘full-tuning’ or ‘fine-tuning’, I would suggest either ‘warmstarting’ or ‘continual learning’. ‘Warmstarting’ is the closest term, I think: you take a ‘fully’ trained model, and then you train it again to the extent of a ‘fully’ trained model, possibly on the same dataset, but just as often, on a new but similarish dataset.
Most people who are trying CodeLlama 70B are not actually using its unusual prompt templating, because it’s not supported in common open source tools for interacting with models. This is corrupting the results to a degree (though curiously not the smaller models).
Smaller Qwen models are pretty good, at least.
Local models are chosen for some business applications dealing with sensitive data that you don’t want to send over API, so Mistral may have decided to make a local model for a customer. Especially with it being tuned on a Llama 70B, this makes sense to me? They would need to have (a form of) the model to run inference locally. Miqu responds similarly to Mistral-Medium but is apparently less good at tasks that take more intelligence, which you’d expect if it’s a less capable model trained on the same dataset.
Fine-tuning a Llama2-70B for a customer use case that needs local functionality hopefully isn’t that unsafe, because that’s pretty common and reasonable. No one especially wants to send medical records to the cloud. But obviously the person you’re fine-tuning for can leak whatever they get if they want.
Prompt engineering is most helpful when you have an LLM step as part of an algorithm, and you want to get optimal outputs repeatedly.
It’s also helpful when working with different forms of model, but also unique by model—trying to get capability out of 7B models benefits from examples in one way, slightly larger models a different way, out of Mixtral in a different way, and out of smarter models like GPT-4 or Mistral-Medium in a different way; some prompting techniques that work with one form actively impede another. Few-shot prompting is great for completion models; one-shot examples greatly helped 7B or 10.7B models I know, but additional examples actively impede them in the use case I was testing; telling it not to do something helped the 7B but impeded the 10.7B, Mixtral is better at just following counterfactual instructions while smart anti-hallucinating models desperately want to correct you …
Based on Mensch’s response, Miqu is probably continued pretraining starting at Llama2-70B, a process similar to how CodeLlama or Llemma were trained. (Training on large datasets comparable with the original pretraining dataset is usually not called fine-tuning.)
If Miqu underwent continued pretraining from Llama2-70B, the dataset won’t be quite the same, unless mistral-medium is also pretrained after Llama2-70B (in which case it won’t be released under Apache 2).
Hmm. Not sure how relevant here, but do we currently have any good terms to distinguish full-tuning the model in line with the original method of pretraining, and full layer LoRA adaptations that ‘effectively’ continue pretraining but are done in a different manner? I’ve seen it can be used for continued pretraining as well as finetuning, but I don’t know if I’d actually call it a full tune, and I don’t think it has the same expenses. I’m honestly unsure what distinguishes a pretraining LoRA from a fine-tuning LoRA.
Even if the dataset is a bit different between miqu and mistral-medium, they apparently have quite similar policies, and continued pretraining would push it even more to the new dataset than fine-tuning to my understanding.
Right, a probable way of doing continued pretraining could as well be called “full-tuning”, or just “tuning” (which is what you said, not “fine-tuning”), as opposed to “fine-tuning” that trains fewer weights. Though people seem unsure about “fine-tuning” implying that it’s not full-tuning, resulting in terms like dense fine-tuning to mean full-tuning.
You mean like ReLoRA, where full rank pretraining is followed by many batches of LoRA that get baked in? Fine-pretraining :-) Feels like a sparsity-themed training efficiency technique, which doesn’t lose it centrality points in being used for “pretraining”. To my mind, tuning is cheaper adaptation, things that use OOMs less data than pretraining (even if it’s full-tuning). So maybe the terms tuning/pretraining should be defined by the role those parts of the training play in the overall process rather than by the algorithms involved? This makes fine-tuning an unnecessarily specific term, claiming that it’s both tuning and trains fewer weights.
If you wanted a term which would be less confusing than calling continued pretraining ‘full-tuning’ or ‘fine-tuning’, I would suggest either ‘warmstarting’ or ‘continual learning’. ‘Warmstarting’ is the closest term, I think: you take a ‘fully’ trained model, and then you train it again to the extent of a ‘fully’ trained model, possibly on the same dataset, but just as often, on a new but similarish dataset.