How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.
We have minimal information about Print Shops. I wouldn’t say the existing data are garbage, just mostly unquantified.
Have you considered googling for previous work?
Yes, but thanks to you I know the shibboleth of “Bayesian stylometry.” Makes sense, and I’ve already read some books in a similar vein, but there are some problems. Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks. Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods. Can you clear any of this up?
EDIT: I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?
Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks.
Have to define your features somehow.
Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods.
Really? I was under the opposite impression, that stylometry was, since the ’60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism.
I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?
No, not really. I think I would describe GIGO in this context as ‘data which is equally consistent with all theories’.
I have just such a thing, referred to as “Marks.” I haven’t yet included that in the code, because I wanted to explore the viability of the method first. So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?
So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?
You claimed to not know what printers there were, how many there were, and what connection they had to ‘Marks’. In such a situation, what on earth do you think you can infer at all? You have to start somewhere: ‘we have good reason to believe there were not more than 20 printers, and we think the London printer usually messed up the last page. Now, from this we can start constructing these phylogenetic trees indicating the most likely printers for our sample of books...’ There is no view from nowhere, you cannot pick yourself up by your bootstraps, all observation is theory-laden, etc.
This all sounds good to me. In fact, I believe that researchers in the humanities are especially (perhaps overly) sensitive to the reciprocal relationship between theory and observation.
I may have overstated the ignorance of the current situation. The scholarly community has already made some claims connecting the Big Book to Print Shops [x,y,z]. The problem is that those claims are either made on non-quantitative bases (eg, “This mark seems characteristic of this Print Shop’s status.”) or on a very naive frequentist basis (eg, “This mark comes up N times, and that’s a big number, so it must be from Print Shop X”). My project would take these existing claims as priors. Is that valid?
I have no idea. If you want answers like that, you should probably go talk to a statistician at sufficient length to convey the domain-specific knowledge involved or learn statistics yourself.
Interesting feedback.
Ha, I wish. No, it’s more specific to literature.
We have minimal information about Print Shops. I wouldn’t say the existing data are garbage, just mostly unquantified.
Yes, but thanks to you I know the shibboleth of “Bayesian stylometry.” Makes sense, and I’ve already read some books in a similar vein, but there are some problems. Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks. Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods. Can you clear any of this up?
EDIT: I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?
Have to define your features somehow.
Really? I was under the opposite impression, that stylometry was, since the ’60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism.
No, not really. I think I would describe GIGO in this context as ‘data which is equally consistent with all theories’.
I don’t understand what this means. Can you say more?
http://en.wikipedia.org/wiki/Feature_%28machine_learning%29 A specific concrete variable you can code up, like ‘total number of commas’.
I have just such a thing, referred to as “Marks.” I haven’t yet included that in the code, because I wanted to explore the viability of the method first. So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?
You claimed to not know what printers there were, how many there were, and what connection they had to ‘Marks’. In such a situation, what on earth do you think you can infer at all? You have to start somewhere: ‘we have good reason to believe there were not more than 20 printers, and we think the London printer usually messed up the last page. Now, from this we can start constructing these phylogenetic trees indicating the most likely printers for our sample of books...’ There is no view from nowhere, you cannot pick yourself up by your bootstraps, all observation is theory-laden, etc.
This all sounds good to me. In fact, I believe that researchers in the humanities are especially (perhaps overly) sensitive to the reciprocal relationship between theory and observation.
I may have overstated the ignorance of the current situation. The scholarly community has already made some claims connecting the Big Book to Print Shops [x,y,z]. The problem is that those claims are either made on non-quantitative bases (eg, “This mark seems characteristic of this Print Shop’s status.”) or on a very naive frequentist basis (eg, “This mark comes up N times, and that’s a big number, so it must be from Print Shop X”). My project would take these existing claims as priors. Is that valid?
I have no idea. If you want answers like that, you should probably go talk to a statistician at sufficient length to convey the domain-specific knowledge involved or learn statistics yourself.