The Natural Plan paper has an insane amount of errors in it. Reading it feels like I’m going crazy.
This meeting planning task seems unsolvable:
The solution requires traveling from SOMA to Nob Hill in 10 minutes, but the text doesn’t mention the travel time between SOMA and Nob Hill. Also the solution doesn’t mention meeting Andrew at all, even though that was part of the requirements.
Here’s an example of the trip planning task:
The trip is supposed to be 14 days, but requires visiting Bucharest for 5 days, London for 4 days, and Reykjavik for 7 days. I guess the point is that you can spend a day in multiple cities, but that doesn’t match with an intuitive understanding of what it means to “spend N days” in a city. Also, by that logic you could spend a total of 28 days in different cities by commuting every day, which contradicts the authors’ claim that each problem only has one solution.
Thanks for evaluating it in detail. I assumed that they at least hadn’t screwed up the problems! Editing the piece to note that the paper has problems.
Disappointingly, a significant number of existing benchmarks & evals have problems like that IIRC.
The Natural Plan paper has an insane amount of errors in it. Reading it feels like I’m going crazy.
This meeting planning task seems unsolvable:
The solution requires traveling from SOMA to Nob Hill in 10 minutes, but the text doesn’t mention the travel time between SOMA and Nob Hill. Also the solution doesn’t mention meeting Andrew at all, even though that was part of the requirements.
Here’s an example of the trip planning task:
The trip is supposed to be 14 days, but requires visiting Bucharest for 5 days, London for 4 days, and Reykjavik for 7 days. I guess the point is that you can spend a day in multiple cities, but that doesn’t match with an intuitive understanding of what it means to “spend N days” in a city. Also, by that logic you could spend a total of 28 days in different cities by commuting every day, which contradicts the authors’ claim that each problem only has one solution.
Thanks for evaluating it in detail. I assumed that they at least hadn’t screwed up the problems! Editing the piece to note that the paper has problems.
Disappointingly, a significant number of existing benchmarks & evals have problems like that IIRC.
Thanks for writing this post!