Ann comments on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Ann 16 Apr 2025 16:03 UTC
3 points
0
DeepSeek-R1 is currently the best model at creative writing as judged by Sonnet 3.7 (https://eqbench.com/creative_writing.html). This doesn’t necessarily correlate with human preferences, including coherence preferences, but having interacted with both DeepSeek-v3 (original flavor), Deepseek-R1-Zero and DeepSeek-R1 … Personally I think R1′s unique flavor in creative outputs slipped in when the thinking process got RL’d for legibility. This isn’t a particularly intuitive way to solve for creative writing with reasoning capability, but gestures at the potential in “solving for writing”, given some feedback on writing style (even orthogonal feedback) seems to have significant impact on creative tasks.

Edit: Another (cheaper to run) comparison for creative capability in reasoning models is QwQ-32B vs Qwen2.5-32B (the base model) and Qwen2.5-32B-Instruct (original instruct tune, not clear if in the ancestry of QwQ). Basically I do not consider 3.7 currently a “reasoning” model at the same fundamental level as R1 or QwQ, even though they have learned to make use of reasoning better than they would have without training on it, and evidence from them about reasoning models is weaker.
- lemon10 17 Apr 2025 20:00 UTC
  4 points
  0
  Parent
  >DeepSeek-R1 is currently the best model at creative writing as judged by Sonnet 3.7 (https://eqbench.com/creative_writing.html). This doesn’t necessarily correlate with human preferences, including coherence preferences.
  It should be noted that “best at creative writing” is very different from “best at multi-turn writing and roleplaying in collaboration with humans”. I haven’t used R1 since its first major version (maybe its gotten better?), but it had some massive issues with instruction following, resulting in laser focusing on irrelevant minor details (What’s that? The character has anger issues? Better write them breaking or damaging something literally every reply) and generally being extremely hard to guide into actually writing what you want.
  So in theory sure, its great at writing stories (and it is, it has a very unique voice compared to other AI) in theory, but using it in multi turn discussions (most practical uses, such as using it to help you write a story) getting it to follow the spirit of the prompt and write in line with what you want it to write feels like pulling teeth.