I would start with dataset of errors in reasoning. Just generate 100 000 texts using GPT-2, put them in the Mechanical turk for marking reasoning errors, and then train another neural net to find logical or other types of errors bad on this dataset.
I would start with dataset of errors in reasoning. Just generate 100 000 texts using GPT-2, put them in the Mechanical turk for marking reasoning errors, and then train another neural net to find logical or other types of errors bad on this dataset.