We propose and run an LLM-driven discovery process to synthesize novel preference optimization algorithms.
We use this pipeline to discover multiple high-performing preference optimization losses. One such loss, which we call Discovered Preference Optimization (DiscoPOP), achieves state-of-the-art performance across multiple held-out evaluation tasks, outperforming Direct Preference Optimization (DPO) and other existing methods.
We provide an initial analysis of DiscoPOP, to discover surprising and counterintuitive features.
We open-source not only the tuned model checkpoints and discovered objective functions but also the codebase for running the discovery process itself on GitHub and HuggingFace.′
’Our general LLM-driven discovery approach can be grouped into three steps:
First, we prompt an LLM with an initial task and problem description. We optionally add examples or previous evaluations and their recorded performance to the initial prompt.
The LLM is tasked with outputting a hypothesis, a name for the new method, and a code implementation of their method. We use the code in an inner loop training run and store the downstream performance of interest.
Finally, we update the context of the LLM with the results from the newly evaluated performance.′
Interesting automated AI safety R&D demo:
’In this release:
We propose and run an LLM-driven discovery process to synthesize novel preference optimization algorithms.
We use this pipeline to discover multiple high-performing preference optimization losses. One such loss, which we call Discovered Preference Optimization (DiscoPOP), achieves state-of-the-art performance across multiple held-out evaluation tasks, outperforming Direct Preference Optimization (DPO) and other existing methods.
We provide an initial analysis of DiscoPOP, to discover surprising and counterintuitive features.
We open-source not only the tuned model checkpoints and discovered objective functions but also the codebase for running the discovery process itself on GitHub and HuggingFace.′
’Our general LLM-driven discovery approach can be grouped into three steps:
First, we prompt an LLM with an initial task and problem description. We optionally add examples or previous evaluations and their recorded performance to the initial prompt.
The LLM is tasked with outputting a hypothesis, a name for the new method, and a code implementation of their method. We use the code in an inner loop training run and store the downstream performance of interest.
Finally, we update the context of the LLM with the results from the newly evaluated performance.′
https://sakana.ai/llm-squared/
https://arxiv.org/abs/2406.08414