Someone will create an agent that gets 80%+ on SWE-Bench within six months.
I think this is probably above the effective cap on the current implementation of SWE-bench (where you can’t see test cases) because often test cases are specific to the implementation.
E.g. the test cases assume that a given method was named a particular thing even though the task description doesn’t specify.
I think this is probably above the effective cap on the current implementation of SWE-bench (where you can’t see test cases) because often test cases are specific to the implementation.
E.g. the test cases assume that a given method was named a particular thing even though the task description doesn’t specify.