In my bioinformatics studies we did PCA in statistics 102 and I got the impression that it’s a commonly used statistical tool. Do you think people generally do PCA wrong, or is your complaint that it’s not used enough?
I know that many researchers know something about PCA. I do think that it’s not applied nearly enough (c.f. Sarah’s remarks about Asperger’s Syndrome, which was removed from the DSM a few years after she made her post). The main issue to my mind is that when people apply it in psychology they seem to come into it with preconceived notions concerning what they might find, rather than collecting large and diverse datasets, letting it speak for itself, and then trying to interpret what the principal components mean in human terms.
Consider the construct of conscientiousness. It’s very suspicious that it maps onto a prexisting notion, and it’s just not that predictive. I got lots of C’s and D’s in school, but worked 90 hours a week for 12 weeks on my speed dating project. Am I conscientious? ;-) As far as I can tell, they came up with questions based on preconceived notions, then did factor analysis, and came up with a construct that some meaning, while being very far from carving reality at its joints.
The DSM is a mess but I think the problems isn’t that there aren’t people who understand PCA.
There are political reasons inside of the American Psychiatric Association that led it to use definitions that aren’t data driven.
It seem to me like to make major contributions to human knowledge you need to do a lot more than say: “Hey PCA is really great”. You actually have to understand reasons of why people aren’t using it and fixing those reasons.
100 is over a hundred years old, there have been a lot of people thinking that it should be used more. I think I have argued in the past in various time for PCR. The last time was when talking about the design of http://www.omnilibrium.com/ and how it should find factors for political labels via PCR instead of just using the left-right framework. I think I made the same argument for LW census political labels.
It’s very suspicious that it maps onto a prexisting notion, and it’s just not that predictive. I got lots of C’s and D’s in school, but worked 90 hours a week for 12 weeks on my speed dating project.
You say that it’s not predictive for the preexisting notion. That doesn’t mean that various things haven’t been predicted with it. Big Five ratings have been predicted via analysing facebook posts.
It seem to me like to make major contributions to human knowledge you need to do a lot more than say: “Hey PCA is really great”. You actually have to understand reasons of why people aren’t using it and fixing those reasons.
Have you read my speed dating project posts? I haven’t yet written up the most important one on demographics (I can do that soon, just many conflicting priorities), but the one on individual variation in revealed preferences for attractiveness vs intelligence and sincerity starts to get at what I’m talking about.
My project gives a proof of concept for what I’m talking about in the context of social psychology. I’ve never seen such an application. So no, it’s not just the realization that it could be applied, it’s also giving a proof of concept: that’s why it took ~1500 hours rather than ~10 hours.
As far as I can tell, the situation is simply that deep knowledge of the technique hasn’t yet percolated into the social psychology community, and people who do have the relevant background knowledge haven’t actually tried doing social psychology research. All you need is to notice something that’s been missed. There are many such things (see Peter Thiel’s discussion of how there are still secrets in his book “From Zero To One.”)
If I recall correctly, Freeman Dyson has indicated that his demonstration of the equivalence of the two different formulations of quantum electrodynamics isn’t as amazing as people believe, but was largely a function of him being one of the first people to learn both formulations! :-)
So I’d strongly encourage you to pursue your ideas more. I’ve been looking some at the General Social Survey data, where I haven’t yet found something highly nontrivial (maybe I’m looking at the data the wrong way, or maybe it’s just not a good dataset for this). I’d be happy to share my code with you / a cleaned form of the data, if you’re interested in exploring factors for political labels.
It might be that I have gotten to cynic but if you measure 6 variables it’s more likely that one of them get a statistical significant result then if you first turn those 6 variables into 2 variables via PCA.
My project gives a proof of concept for what I’m talking about in the context of social psychology. I’ve never seen such an application. So no, it’s not just the realization that it could be applied, it’s also giving a proof of concept: that’s why it took ~1500 hours rather than ~10 hours.
That probably where there’s something I don’t understand. I don’t understand why the analysis took ~1500 hours. Spending that much time with a dataset also instinctively triggers “fishing expedition” in my head. I don’t know to what extend that’s warranted.
I’m not sure that you have shown that it makes more sense to interpret that factor individual preference is about intelligence and sincerity
than that it’s about the value of fun.
As far as I can see it could also be that fun&physical attractiveness is simply more valued.
So I’d strongly encourage you to pursue your ideas more. I’ve been looking some at the General Social Survey data, where I haven’t yet found something highly nontrivial (maybe I’m looking at the data the wrong way, or maybe it’s just not a good dataset for this). I’d be happy to share my code with you / a cleaned form of the data, if you’re interested in exploring factors for political labels.
In the case of the spending effort on the GSS I can’t envision what success looks like. It’s straightforward to find PCR factors but I don’t know how to put them to good use.
A more interesting project would be to explore LW’s ideological landscape.
It would be very interested in how various rationalist beliefs interact with each other.
Does seeing yourself as an “aspiring rationalist” correlates to beliefs on UFAI risk?
Having a project that searches where the main dimensions of disagreement in this community would be valuable.
Maybe 300 questions that are answered on a Likert scale. Maybe 150 rationality questions, 100 big 5 questions
and 50 autism questions.
It might be that I have gotten to cynic but if you measure 6 variables it’s more likely that one of them get a statistical significant result then if you first turn those 6 variables into 2 variables via PCA.
That probably where there’s something I don’t understand. I don’t understand why the analysis took ~1500 hours. Spending that much time with a dataset also instinctively triggers “fishing expedition” in my head. I don’t know to what extend that’s warranted.
The issue of multiple hypothesis testing is precisely why it took 1500 hours :-). I was dealing with the general question “how can you find the most interesting generalizable patterns in a human interpretable data set?” It’ll take me a long time to externalize what I learned.
For now I’ll just remark that dimensionality reduction reduces concerns around multiple hypothesis testing. If you have a cluster of variables A and a cluster of features B and you suspect that there’s some relationship between the variables A and the variables B, you can do PCA on the two clusters separately, then look at correlations between the first few principal components rather than looking at all pairwise correlations between variables in A and variables in B.
A more interesting project would be to explore LW’s ideological landscape. It would be very interested in how various rationalist beliefs interact with each other. Does seeing yourself as an “aspiring rationalist” correlates to beliefs on UFAI risk?
There is the 2014 LW survey data, which is interesting, even if less substantive than what you have in mind. I have an unfinished project that I’m doing with it (got bogged down in cleaning it to make it nicely readable).
Consider the construct of conscientiousness. It’s very suspicious that it maps onto a prexisting notion...
Is it? We’ve been modeling each other as long as language has existed. Conscientiousness might not correspond to a single well-defined causal system in the brain, but it would be no surprise to me at all to find common words in most languages for close empirical clusters in personality-space. And the Big 5 factors are very much empirical constructs, not causal.
Ok, I guess what I mean is that it’s suspicious that it maps onto a preexisting notion held by the general population, in the same way that it would be suspicious for psychology research to apparently show the existence of demon possession (which humans have in fact believed in). I wouldn’t find it suspicious if it mapped onto a notion of someone with demonstrated exceptional ability to read and connect with people (e.g. Bill Clinton).
The way scientific progress occurs is by developing progressively more refined understandings of what’s going on: for example, passing from the Ptolemaic model of the stars and planets to the Copernican model to the Newtonian model to Einstein’s theory of general relativity. One can’t hope to understand reality if one isn’t flexible enough to recognize that things might be very different from how they initially appear.
In my bioinformatics studies we did PCA in statistics 102 and I got the impression that it’s a commonly used statistical tool. Do you think people generally do PCA wrong, or is your complaint that it’s not used enough?
I know that many researchers know something about PCA. I do think that it’s not applied nearly enough (c.f. Sarah’s remarks about Asperger’s Syndrome, which was removed from the DSM a few years after she made her post). The main issue to my mind is that when people apply it in psychology they seem to come into it with preconceived notions concerning what they might find, rather than collecting large and diverse datasets, letting it speak for itself, and then trying to interpret what the principal components mean in human terms.
Consider the construct of conscientiousness. It’s very suspicious that it maps onto a prexisting notion, and it’s just not that predictive. I got lots of C’s and D’s in school, but worked 90 hours a week for 12 weeks on my speed dating project. Am I conscientious? ;-) As far as I can tell, they came up with questions based on preconceived notions, then did factor analysis, and came up with a construct that some meaning, while being very far from carving reality at its joints.
The DSM is a mess but I think the problems isn’t that there aren’t people who understand PCA. There are political reasons inside of the American Psychiatric Association that led it to use definitions that aren’t data driven.
It seem to me like to make major contributions to human knowledge you need to do a lot more than say: “Hey PCA is really great”. You actually have to understand reasons of why people aren’t using it and fixing those reasons.
100 is over a hundred years old, there have been a lot of people thinking that it should be used more. I think I have argued in the past in various time for PCR. The last time was when talking about the design of http://www.omnilibrium.com/ and how it should find factors for political labels via PCR instead of just using the left-right framework. I think I made the same argument for LW census political labels.
You say that it’s not predictive for the preexisting notion. That doesn’t mean that various things haven’t been predicted with it. Big Five ratings have been predicted via analysing facebook posts.
Have you read my speed dating project posts? I haven’t yet written up the most important one on demographics (I can do that soon, just many conflicting priorities), but the one on individual variation in revealed preferences for attractiveness vs intelligence and sincerity starts to get at what I’m talking about.
My project gives a proof of concept for what I’m talking about in the context of social psychology. I’ve never seen such an application. So no, it’s not just the realization that it could be applied, it’s also giving a proof of concept: that’s why it took ~1500 hours rather than ~10 hours.
As far as I can tell, the situation is simply that deep knowledge of the technique hasn’t yet percolated into the social psychology community, and people who do have the relevant background knowledge haven’t actually tried doing social psychology research. All you need is to notice something that’s been missed. There are many such things (see Peter Thiel’s discussion of how there are still secrets in his book “From Zero To One.”)
If I recall correctly, Freeman Dyson has indicated that his demonstration of the equivalence of the two different formulations of quantum electrodynamics isn’t as amazing as people believe, but was largely a function of him being one of the first people to learn both formulations! :-)
So I’d strongly encourage you to pursue your ideas more. I’ve been looking some at the General Social Survey data, where I haven’t yet found something highly nontrivial (maybe I’m looking at the data the wrong way, or maybe it’s just not a good dataset for this). I’d be happy to share my code with you / a cleaned form of the data, if you’re interested in exploring factors for political labels.
It might be that I have gotten to cynic but if you measure 6 variables it’s more likely that one of them get a statistical significant result then if you first turn those 6 variables into 2 variables via PCA.
That probably where there’s something I don’t understand. I don’t understand why the analysis took ~1500 hours. Spending that much time with a dataset also instinctively triggers “fishing expedition” in my head. I don’t know to what extend that’s warranted.
I’m not sure that you have shown that it makes more sense to interpret that factor individual preference is about intelligence and sincerity than that it’s about the value of fun.
As far as I can see it could also be that fun&physical attractiveness is simply more valued.
In the case of the spending effort on the GSS I can’t envision what success looks like. It’s straightforward to find PCR factors but I don’t know how to put them to good use.
A more interesting project would be to explore LW’s ideological landscape. It would be very interested in how various rationalist beliefs interact with each other. Does seeing yourself as an “aspiring rationalist” correlates to beliefs on UFAI risk?
Having a project that searches where the main dimensions of disagreement in this community would be valuable. Maybe 300 questions that are answered on a Likert scale. Maybe 150 rationality questions, 100 big 5 questions and 50 autism questions.
Yes, this is the point :-)
The issue of multiple hypothesis testing is precisely why it took 1500 hours :-). I was dealing with the general question “how can you find the most interesting generalizable patterns in a human interpretable data set?” It’ll take me a long time to externalize what I learned.
For now I’ll just remark that dimensionality reduction reduces concerns around multiple hypothesis testing. If you have a cluster of variables A and a cluster of features B and you suspect that there’s some relationship between the variables A and the variables B, you can do PCA on the two clusters separately, then look at correlations between the first few principal components rather than looking at all pairwise correlations between variables in A and variables in B.
There is the 2014 LW survey data, which is interesting, even if less substantive than what you have in mind. I have an unfinished project that I’m doing with it (got bogged down in cleaning it to make it nicely readable).
Is it? We’ve been modeling each other as long as language has existed. Conscientiousness might not correspond to a single well-defined causal system in the brain, but it would be no surprise to me at all to find common words in most languages for close empirical clusters in personality-space. And the Big 5 factors are very much empirical constructs, not causal.
Ok, I guess what I mean is that it’s suspicious that it maps onto a preexisting notion held by the general population, in the same way that it would be suspicious for psychology research to apparently show the existence of demon possession (which humans have in fact believed in). I wouldn’t find it suspicious if it mapped onto a notion of someone with demonstrated exceptional ability to read and connect with people (e.g. Bill Clinton).
The way scientific progress occurs is by developing progressively more refined understandings of what’s going on: for example, passing from the Ptolemaic model of the stars and planets to the Copernican model to the Newtonian model to Einstein’s theory of general relativity. One can’t hope to understand reality if one isn’t flexible enough to recognize that things might be very different from how they initially appear.