Vaniver comments on Welcome to Less Wrong! (5th thread, March 2013)

Vaniver 17 Apr 2013 14:36 UTC
9 points
This is a problem that machine learning can tackle. Feel free to contact me by PM for technical help.

To make sure I understand your problem:

We have many copies of the Big Book. Each copy is a collection of many sheets. Each sheet was produced by a single tool, but each tool produces many sheets. Each shop contains many tools, but each tool is owned by only one shop.

Each sheet has information in the form of marks. Sheets created by the same tool at similar times have similar marks. It may be the case that the marks monotonically increase until the tool is repaired.

Right now, we have enough to take a database of marks on sheets and figure out how many tools we think there were, how likely it is each sheet came from each potential tool, and to cluster tools into likely shops. (Note that a ‘tool’ here is probably only one repair cycle of an actual tool, if they are able to repair it all the way to freshness.)

We can either do this unsupervised, and then compare to whatever other information we can find (if we have a subcollection of sheets with known origins, we can see how well the estimated probabilities did), or we can try to include that information for supervised learning.
- HumanitiesResearcher 17 Apr 2013 15:42 UTC
  10 points
  Parent
  That’s a hell of a summary, thanks!
  
  I’m glad you mentioned the repair cycle of tools. There are some tools that are regularly repaired (let’s just call them “Big Tools”) and some that aren’t (“Little Tools”). Both are expensive at first and to repair, but it seems the Print Shops chose to repair Big Tools because they were subject to breakage that significantly reduced performance.
  
  I should add another twist since you mentioned sheets of known origins: Assume that we can only decisively assign origins to single sheets. There are two problems stemming from this assumption: first, not all relevant Marks are left on such sheets; second, very few single sheet publications survive. Collations greater than one sheet are subject to all of the problems of the Big Book.
  
  I’m most interested in the distinction between unsupervised and supervised learning. And I will very likely PM you to learn more about machine learning. Again, thanks for your help!
  
  EDIT: I just noticed a mistake in your summary. Each sheet is produced by a set of tools, not a single tool. Each mark is produced by a single tool.
  - Vaniver 17 Apr 2013 16:20 UTC
    4 points
    Parent
    
    I just noticed a mistake in your summary. Each sheet is produced by a set of tools, not a single tool. Each mark is produced by a single tool.
    
    Okay. Are the classes of marks distinct by tool type- that is, if I see a mark on a sheet, I know whether it came from tool type X or tool type Y- or do we need to try and discover what sort of marks the various tools can leave?
    - HumanitiesResearcher 18 Apr 2013 0:54 UTC
      6 points
      Parent
      Fortunately, we know which tool types leave which marks. We also have a very strong understanding of the ways in which tools break and leave marks.
      
      Thanks again for entertaining this line of inquiry.
- DaFranker 19 Apr 2013 13:33 UTC
  6 points
  Parent
  
  This is a problem that machine learning can tackle. Feel free to contact me by PM for technical help.
  
  Good point!
  
  Also yay combining multiple fields of knowledge and expertise! applause
  
  Seriously though, the world does need more of it, and I felt the need to explicitly reward and encourage this.
  - HumanitiesResearcher 22 Apr 2013 19:58 UTC
    0 points
    Parent
    Thanks! I feel explicitly encouraged.