whpearson comments on Open Thread June 2010, Part 3

whpearson 15 Jun 2010 19:26 UTC
0 points
I’m looking for some concept which I am sure has been talked about before in stats but I’m not sure of the technical term for it.

Lets say you have a function you are trying to guess with a certain range and domain. How would you talk about the amount of data you would need to likely get the actual functions with noisy data? My current thoughts are the larger the cardinality of the domain the more data you would need (in a simple relationship) and the type of noise would determine how much the size of the range would affect the amount of data you would need.
- gwern 21 Jun 2010 2:11 UTC
  1 point
  Parent
  
  How would you talk about the amount of data you would need to likely get the actual functions with noisy data?
  
  First, I would specify what set my ‘function’ is in. Are there 2 possibilities? 10? A million? log2(x) tells me how many bits of information I need. Then I would treat the data as coming to me through a noisy channel. How noisy? I assume you already know how noisy. Now I can plug in the noise level to Shannon’s theorem, and that tells me how many noisy bits I need to get my log2(x) bits.
  
  (This all seems like very layman information theory, which makes me wonder if something makes your problem harder than this.)
  - whpearson 25 Jun 2010 14:20 UTC
    1 point
    Parent
    By data I meant training data (in a machine learning context), not information.
    
    And it wasn’t really the math I was after, it is quite simple, just whether it had been discussed before.
    
    My thoughts on the math. If the cardinality of the input is X and output is Y then the bound for the space of functions you are exploring is by X^Y. E.G. there are 4 possible functions from 1 binary bit to another. (Set to 0, Set to 1, Invert, Keep the Same). I’ve come across this in simple category theory.
    
    However in order to fully specify which function it is (assuming no noise). You need a minimum of 2 pieces of training data (where training data is input ouput pairs). If you have the training pair (0,0) you don’t know if that means keep the same or set to 0. Fairly obviously you need as many unique samples of training data as the cardinality of the domain. You need more when you have noise.
    
    This is a less efficient way of getting information about functions, than just getting the Turing number or similar.
    
    So I’m really wondering if there are guidelines for people designing machine learning systems, that I am missing. So if you know you can only get 5000 training examples, you know you that a system which tries to learn from the entire space of functions with the domain much larger that 5000 is not going to be very accurate unless you have put a lot of information into the prior/hypothesis space.
- cupholder 16 Jun 2010 0:06 UTC
  1 point
  Parent
  The closest thing I can think of is identifiability, although that’s more about whether it’s possible to identify a function given an arbitrarily large amount of data.
  - whpearson 17 Jun 2010 15:44 UTC
    1 point
    Parent
    Hmm, not quite what I was looking for but interesting none the less.
    
    Thanks.