I have only very limited knowledge in this area, so I could be misreading you. But doesn’t “in training data sets” mean that the process had been developed using that specific data? That could mean that you have a program really good at reconstructing that piece of mouse brain, but not at reconstructing mouse brain in general. We had this problem in the last research project I worked on, where we’d use a gene expression data set to predict mood in bipolar subjects. We had to test the predictions on a separate data set from the one used in development to make sure it wasn’t overfit to the training data. Is the same thing the case for your work, or am I misunderstanding your use of “training data”?
It is a good insight to notice that this is a potential problem, which is generally referred to as a generalization error. If you train a classifier or compute a regression on some data, there is always a chance that when you are given new data, it will perform poorly because of unforeseen larger-scale patterns that were poorly represented in the training data.
However, the scientists performing this work as also aware of this. This is why algorithmic learning theory, like machine learning methods, is so successful. You can derive tight bounds on generalization error. The process you refer to with the gene expression—testing on additional labeled data to see that you are not overfitting and that your parameters give good predictive power—is called cross-validation, and it’s definitely a huge part of the connectomics project.
You might enjoy this paper by Leo Breiman, which talks about this exact distinction between merely fitting data vs. algorithmic data analysis. Many statisticians are still stuck believing that it is good to assume underlying analytic models for nature and then use goodness-of-fit tests to determine which underlying models are best. This is a categorically bad way to analyze data except in some special cases. Algorithmic data analysis instead uses cross-validation to measure accuracy and seeks to model the data formation process algorithmically rather than generatively.
Most computer scientists are not even aware of this distinction because the algorithmic approach (usually through machine learning) is the only one they have ever even been taught.
Thanks for the response and the paper link. I’m confident that the connectomics project does use cross-validation. I’m just wondering, is the 95+% accuracy you mentioned on the training data or the test data?
It is from cross validation. The training data is for building their procedure, and then the procedure is applied to testing data that was kept separate from the data used to train.
I have only very limited knowledge in this area, so I could be misreading you. But doesn’t “in training data sets” mean that the process had been developed using that specific data? That could mean that you have a program really good at reconstructing that piece of mouse brain, but not at reconstructing mouse brain in general. We had this problem in the last research project I worked on, where we’d use a gene expression data set to predict mood in bipolar subjects. We had to test the predictions on a separate data set from the one used in development to make sure it wasn’t overfit to the training data. Is the same thing the case for your work, or am I misunderstanding your use of “training data”?
It is a good insight to notice that this is a potential problem, which is generally referred to as a generalization error. If you train a classifier or compute a regression on some data, there is always a chance that when you are given new data, it will perform poorly because of unforeseen larger-scale patterns that were poorly represented in the training data.
However, the scientists performing this work as also aware of this. This is why algorithmic learning theory, like machine learning methods, is so successful. You can derive tight bounds on generalization error. The process you refer to with the gene expression—testing on additional labeled data to see that you are not overfitting and that your parameters give good predictive power—is called cross-validation, and it’s definitely a huge part of the connectomics project.
You might enjoy this paper by Leo Breiman, which talks about this exact distinction between merely fitting data vs. algorithmic data analysis. Many statisticians are still stuck believing that it is good to assume underlying analytic models for nature and then use goodness-of-fit tests to determine which underlying models are best. This is a categorically bad way to analyze data except in some special cases. Algorithmic data analysis instead uses cross-validation to measure accuracy and seeks to model the data formation process algorithmically rather than generatively.
Most computer scientists are not even aware of this distinction because the algorithmic approach (usually through machine learning) is the only one they have ever even been taught.
Thanks for the response and the paper link. I’m confident that the connectomics project does use cross-validation. I’m just wondering, is the 95+% accuracy you mentioned on the training data or the test data?
It is from cross validation. The training data is for building their procedure, and then the procedure is applied to testing data that was kept separate from the data used to train.
I see. Good for them! Thanks for the info.