Grok-2 and Llama-3-405b seem better at opinion/essay writing (with Llama able to decide to be brief when appropriate), Claude-3.5-Sonnet thinks most clearly (closely keeping track of the intended plausible position). GPT-4o doesn’t seem exceptional on any relevant dimension. Gemini-1.5-Pro can occasionally notice some non-trivial implication all others would miss. Weaker chatbots are even more pointlessly verbose and non-committal, so I expect this problem will be gone in a year as strengths of the current frontier models become robust with additional scale.
Grok-2 and Llama-3-405b seem better at opinion/essay writing (with Llama able to decide to be brief when appropriate), Claude-3.5-Sonnet thinks most clearly (closely keeping track of the intended plausible position). GPT-4o doesn’t seem exceptional on any relevant dimension. Gemini-1.5-Pro can occasionally notice some non-trivial implication all others would miss. Weaker chatbots are even more pointlessly verbose and non-committal, so I expect this problem will be gone in a year as strengths of the current frontier models become robust with additional scale.