Most of the NLP (natural language processing, not the other NLP) research I do is loosely validated by a shared task and competitive performance on held-out test data. There is not much chance that a bug leads to higher task accuracy (usually agreement with human judgments). But it’s true that if you had a great idea which could lead to a huge improvement, but only shows a small one because your implementation has bugs, then you may assume that the idea was not really that effective, and not hunt for the bugs. Whereas if the performance is actually worse than before this change, and you know the change to be a good idea, you will hunt very hard for the bugs.
I suppose wherever progress is most incremental (everyone is using the same basic systems with the novel part being some small addition), then there’s also a real risk of bad or buggy small changes being published because they happen by pure chance to give a slightly higher accuracy on the standard test data. But such random-noise additions will in fact do worse on any different test set, and there is some chance of detecting them using significance tests.
I guess it is a problem for the field that new (different domain or merely freshly generated) test sets aren’t automatically used to score all published past systems to detect that effect. This should theoretically be possible by requiring open-source implementations of published results.
Most of the NLP (natural language processing, not the other NLP) research I do is loosely validated by a shared task and competitive performance on held-out test data. There is not much chance that a bug leads to higher task accuracy (usually agreement with human judgments). But it’s true that if you had a great idea which could lead to a huge improvement, but only shows a small one because your implementation has bugs, then you may assume that the idea was not really that effective, and not hunt for the bugs. Whereas if the performance is actually worse than before this change, and you know the change to be a good idea, you will hunt very hard for the bugs.
I suppose wherever progress is most incremental (everyone is using the same basic systems with the novel part being some small addition), then there’s also a real risk of bad or buggy small changes being published because they happen by pure chance to give a slightly higher accuracy on the standard test data. But such random-noise additions will in fact do worse on any different test set, and there is some chance of detecting them using significance tests.
I guess it is a problem for the field that new (different domain or merely freshly generated) test sets aren’t automatically used to score all published past systems to detect that effect. This should theoretically be possible by requiring open-source implementations of published results.