I’m not an expert in machine learning either, but here is what I meant.
If you’re running an algorithm such as linear or logistic regression, then there are two dimension numbers that are relevant: the number of data points, and the number of features (i.e., the number of parameters). For the design matrix of the regression, the number of data points is the number of rows and the number of features/parameters is the number of columns.
Holding the number of parameters constant, it’s true that if you increase the number of data points beyond a certain amount, you can get most of the value through subsampling. And even if not, more data points is not such a big issue.
But the main advantage of having more data is lost if you still use the same (small) number of features. Generally, when you have more data, you’d try to use that additional data to use a model with more features. The number of features would still be less than the number of data points. I’d say that in many cases it’s about 1% of the number of data points.
Of course, you could still use the model with the smaller number of features. In that case, you’re just not putting the new data to much good use. Which is fine, but not an effective use of the enlarged data set. (There may be cases where even with more data, adding more features is no use, because the model has already reached the limits of its predictive power).
For linear regression, the algorithm to solve it exactly (using normal equations) takes time that is cubic in the number of parameters (if you use the naive inverse). Although matrix inversion can in principle be done faster than cubic, it can’t be faster than quadratic, which is a general lower bound. Other iterative algorithms aren’t quite cubic, but they’re still more than linear.
That makes sense. And based on what I’ve seen, having more data to feed in to your model really is a pretty big asset when it comes to machine learning (I think I’ve seen this article referenced).
Good question.
I’m not an expert in machine learning either, but here is what I meant.
If you’re running an algorithm such as linear or logistic regression, then there are two dimension numbers that are relevant: the number of data points, and the number of features (i.e., the number of parameters). For the design matrix of the regression, the number of data points is the number of rows and the number of features/parameters is the number of columns.
Holding the number of parameters constant, it’s true that if you increase the number of data points beyond a certain amount, you can get most of the value through subsampling. And even if not, more data points is not such a big issue.
But the main advantage of having more data is lost if you still use the same (small) number of features. Generally, when you have more data, you’d try to use that additional data to use a model with more features. The number of features would still be less than the number of data points. I’d say that in many cases it’s about 1% of the number of data points.
Of course, you could still use the model with the smaller number of features. In that case, you’re just not putting the new data to much good use. Which is fine, but not an effective use of the enlarged data set. (There may be cases where even with more data, adding more features is no use, because the model has already reached the limits of its predictive power).
For linear regression, the algorithm to solve it exactly (using normal equations) takes time that is cubic in the number of parameters (if you use the naive inverse). Although matrix inversion can in principle be done faster than cubic, it can’t be faster than quadratic, which is a general lower bound. Other iterative algorithms aren’t quite cubic, but they’re still more than linear.
That makes sense. And based on what I’ve seen, having more data to feed in to your model really is a pretty big asset when it comes to machine learning (I think I’ve seen this article referenced).