There’s a system (I think maintained by NASA) called AutoClass, which is fairly easy to use. As I understand it, it accepts input “points” (who here would be people) and outputs clusters of similar points (people).
In order to predict using AutoClass, I think you would model unanswered questions as “missing values”, and then predict based on observed frequencies from the same cluster.
There’s some ad-hoc-ish-ness about the way AutoClass decides how many clusters there should be, but it’s a solid, existing technology that has been used successfully in many applications.
There’s some ad-hoc-ish-ness about the way AutoClass decides how many clusters there should be...
If a collaborative filter algorithm is accurate, that’s all that really matters to the consumers of the algorithm. It’s primarily the designers of the algorithm who care about the scientific basis as to why the algorithm works.
I find it both amusing and disturbing how pioneers in this field have been trying to optimize guessing movie preferences (recall the famous Netflix $1 million prize), when we can use these techniques to actually predict stuff that IMBO “really matters”. It’s yet another interesting data point as to what people really care about.
Rather, it’s a reminder that more effort is spent on projects that can be immediately profitable, however trivial they may be in scope. If there were a pay market for intellectual content as robust as the one for movies currently is, we’d have seen this done already. (Alas, I don’t see how that could happen in the near future.)
I’ve been considering the question: why has using Collaborative Filtering to predict the “trivial” opinions about movies been prioritized over predicting the “important” opinions about political/social/economic/etc. issues?
On reflection, I don’t actually think it’s because people care more about the former than the latter. Would you rather have a prediction regarding the opinion as to whether the movie Titanic was good, or a prediction regarding the opinion as to whether there’s a housing bubble?
I think the answer is that opinions about products are naturally schematized, and hence easy to collate. Products are already tracked everywhere in databases, so it’s pretty easy to extend that model to add opinions about those products. In contrast, opinions about issues, although often even more passionate than opinions about products, are not as naturally schematizable, hence they’re harder to collate. Even in terms of representing the identity of an issue, it’s not like we have the equivalent of an ISBN number for each issue. So opinions about issues are not adequately schematized and hence we can’t collate those opinions into the nice big datasets we’d want to make predictions. Obviously websites like TakeOnIt are trying to change that. Each question ID is analogous to an ISBN number for that issue, if you will.
Yes, I agree with you that opinions of products can help sell products, so predicting opinions about products has the incentive of an immediately obvious monetization strategy. But if there’s money to be made depending on correctly predicting the answer to a question, then the potential for monetization of the prediction of those opinions is also there.
Well there’s a subset of questions on TakeOnIt where the correct answer has a financial reward/impact. An example of such a question was Is there a housing bubble in the United States?. These type of questions overlap with the kind of questions seen on prediction markets (which is a nice model for monetizing intellectual content). I’d be curious as to the relative accuracy between prediction markets and using collaborative filtering on expert predictions.
There’s a system (I think maintained by NASA) called AutoClass, which is fairly easy to use. As I understand it, it accepts input “points” (who here would be people) and outputs clusters of similar points (people).
In order to predict using AutoClass, I think you would model unanswered questions as “missing values”, and then predict based on observed frequencies from the same cluster.
There’s some ad-hoc-ish-ness about the way AutoClass decides how many clusters there should be, but it’s a solid, existing technology that has been used successfully in many applications.
If a collaborative filter algorithm is accurate, that’s all that really matters to the consumers of the algorithm. It’s primarily the designers of the algorithm who care about the scientific basis as to why the algorithm works.
A decent overview of the various CF algorithms:
http://www.hindawi.com/journals/aai/2009/421425.html
I find it both amusing and disturbing how pioneers in this field have been trying to optimize guessing movie preferences (recall the famous Netflix $1 million prize), when we can use these techniques to actually predict stuff that IMBO “really matters”. It’s yet another interesting data point as to what people really care about.
Rather, it’s a reminder that more effort is spent on projects that can be immediately profitable, however trivial they may be in scope. If there were a pay market for intellectual content as robust as the one for movies currently is, we’d have seen this done already. (Alas, I don’t see how that could happen in the near future.)
I’ve been considering the question: why has using Collaborative Filtering to predict the “trivial” opinions about movies been prioritized over predicting the “important” opinions about political/social/economic/etc. issues?
On reflection, I don’t actually think it’s because people care more about the former than the latter. Would you rather have a prediction regarding the opinion as to whether the movie Titanic was good, or a prediction regarding the opinion as to whether there’s a housing bubble?
I think the answer is that opinions about products are naturally schematized, and hence easy to collate. Products are already tracked everywhere in databases, so it’s pretty easy to extend that model to add opinions about those products. In contrast, opinions about issues, although often even more passionate than opinions about products, are not as naturally schematizable, hence they’re harder to collate. Even in terms of representing the identity of an issue, it’s not like we have the equivalent of an ISBN number for each issue. So opinions about issues are not adequately schematized and hence we can’t collate those opinions into the nice big datasets we’d want to make predictions. Obviously websites like TakeOnIt are trying to change that. Each question ID is analogous to an ISBN number for that issue, if you will.
Yes, I agree with you that opinions of products can help sell products, so predicting opinions about products has the incentive of an immediately obvious monetization strategy. But if there’s money to be made depending on correctly predicting the answer to a question, then the potential for monetization of the prediction of those opinions is also there.
Well there’s a subset of questions on TakeOnIt where the correct answer has a financial reward/impact. An example of such a question was Is there a housing bubble in the United States?. These type of questions overlap with the kind of questions seen on prediction markets (which is a nice model for monetizing intellectual content). I’d be curious as to the relative accuracy between prediction markets and using collaborative filtering on expert predictions.