I’ve created an ensemble model that employs techniques like multi-step reasoning to establish what should be considered the real current state-of-the-art in LLMs. It substantially exceeds the highest-scoring individual models and subjectively feels smarter:
MMLU-Pro 0-shot CoT: 78.2 vs 75.6 for GPT-4o
NYT Connections, 436 questions: 34.9 vs 26.5 for GPT-4o
GPQA 0-shot CoT: 56.0 vs 52.5 for Claude 3.5 Sonnet.
I might make it publicly accessible if there’s enough interest. Of course, there are expected tradeoffs: it’s slower and more expensive to run.
I’ve created an ensemble model that employs techniques like multi-step reasoning to establish what should be considered the real current state-of-the-art in LLMs. It substantially exceeds the highest-scoring individual models and subjectively feels smarter:
MMLU-Pro 0-shot CoT: 78.2 vs 75.6 for GPT-4o
NYT Connections, 436 questions: 34.9 vs 26.5 for GPT-4o
GPQA 0-shot CoT: 56.0 vs 52.5 for Claude 3.5 Sonnet.
I might make it publicly accessible if there’s enough interest. Of course, there are expected tradeoffs: it’s slower and more expensive to run.