During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.
These behaviors typically emerge after multiple failed attempts to develop a general solution, particularly when:
• The model struggles to devise a comprehensive solution
• Test cases present conflicting requirements
• Edge cases prove difficult to resolve within a general framework
The model typically follows a pattern of first attempting multiple general solutions, running tests, observing failures, and debugging. After repeated failures, it sometimes implements special cases for problematic tests.
When adding such special cases, the model often (though not always) includes explicit comments indicating the special-casing (e.g., “# special case for test XYZ”).
Hey I do this too!
Isn’t it normal in startup world to make bets and not make money for many years? I am not familiar with the field so I don’t have intuitions for how much money/how many years would make sense, so I don’t know if OpenAI is doing something normal, or something wild.