I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won’t really work until their brain matures more.
If system prompts aren’t enough but fine-tuning is, this should be doable with different adapters that can be loaded at inference time; not needing to distill into separate models.
Yes, I agree that’s an alternative. Then you’d need the primary model to be less RLHF’d and focused. A more raw model should be capable, with an adapter, of expressing a wider variety of behaviors.
I still think that distilling down from specialized large teacher models world likely give the best result, but that’s just a hunch.
I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won’t really work until their brain matures more.
For raw IQ, sure. I just mean “conversational flavor”.
If system prompts aren’t enough but fine-tuning is, this should be doable with different adapters that can be loaded at inference time; not needing to distill into separate models.
Yes, I agree that’s an alternative. Then you’d need the primary model to be less RLHF’d and focused. A more raw model should be capable, with an adapter, of expressing a wider variety of behaviors.
I still think that distilling down from specialized large teacher models world likely give the best result, but that’s just a hunch.