whpearson comments on Open Thread June 2010, Part 3

whpearson 25 Jun 2010 14:20 UTC
1 point
By data I meant training data (in a machine learning context), not information.

And it wasn’t really the math I was after, it is quite simple, just whether it had been discussed before.

My thoughts on the math. If the cardinality of the input is X and output is Y then the bound for the space of functions you are exploring is by X^Y. E.G. there are 4 possible functions from 1 binary bit to another. (Set to 0, Set to 1, Invert, Keep the Same). I’ve come across this in simple category theory.

However in order to fully specify which function it is (assuming no noise). You need a minimum of 2 pieces of training data (where training data is input ouput pairs). If you have the training pair (0,0) you don’t know if that means keep the same or set to 0. Fairly obviously you need as many unique samples of training data as the cardinality of the domain. You need more when you have noise.

This is a less efficient way of getting information about functions, than just getting the Turing number or similar.

So I’m really wondering if there are guidelines for people designing machine learning systems, that I am missing. So if you know you can only get 5000 training examples, you know you that a system which tries to learn from the entire space of functions with the domain much larger that 5000 is not going to be very accurate unless you have put a lot of information into the prior/hypothesis space.