gwern comments on Open Thread June 2010, Part 3

gwern 21 Jun 2010 2:11 UTC
1 point

How would you talk about the amount of data you would need to likely get the actual functions with noisy data?

First, I would specify what set my ‘function’ is in. Are there 2 possibilities? 10? A million? log2(x) tells me how many bits of information I need. Then I would treat the data as coming to me through a noisy channel. How noisy? I assume you already know how noisy. Now I can plug in the noise level to Shannon’s theorem, and that tells me how many noisy bits I need to get my log2(x) bits.

(This all seems like very layman information theory, which makes me wonder if something makes your problem harder than this.)
- whpearson 25 Jun 2010 14:20 UTC
  1 point
  Parent
  By data I meant training data (in a machine learning context), not information.
  
  And it wasn’t really the math I was after, it is quite simple, just whether it had been discussed before.
  
  My thoughts on the math. If the cardinality of the input is X and output is Y then the bound for the space of functions you are exploring is by X^Y. E.G. there are 4 possible functions from 1 binary bit to another. (Set to 0, Set to 1, Invert, Keep the Same). I’ve come across this in simple category theory.
  
  However in order to fully specify which function it is (assuming no noise). You need a minimum of 2 pieces of training data (where training data is input ouput pairs). If you have the training pair (0,0) you don’t know if that means keep the same or set to 0. Fairly obviously you need as many unique samples of training data as the cardinality of the domain. You need more when you have noise.
  
  This is a less efficient way of getting information about functions, than just getting the Turing number or similar.
  
  So I’m really wondering if there are guidelines for people designing machine learning systems, that I am missing. So if you know you can only get 5000 training examples, you know you that a system which tries to learn from the entire space of functions with the domain much larger that 5000 is not going to be very accurate unless you have put a lot of information into the prior/hypothesis space.