(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google “error correcting codes machine learning”)[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learning] (arguably this has been the clearest example of a paradigm shift that comes from thinking about compression which had a big impact in machine learning). I don’t know how many people think about these problems in terms of information theory questions (since I don’t have much access to their thoughts): but I do know at least two very competent researchers who, although they never bring it outright into their papers, they have an information-theory and compression-oriented way of posing and thinking about problems.
I often try to think of how humans process speech in terms of information theory (which is inspired by a couple of great thinkers in the area), and thus I think that it is useful for understanding and probing the questions of sensory perception.
There’s also a whole literature on “sparse coding” (another compression-oriented idea originally developed by biologist but since ported over by computer vision and a few speech researchers) whose promise in machine learning may not have been realized yet, but I have seen at least a couple somewhat impressive applications of related techniques appearing.
This isn’t precisely what Daniel_Burfoot was talking about but its a related idea based on “sparse coding” and it has recently obtained good results in classification:
http://www.di.ens.fr/~fbach/icml2010a.pdf
Here the “theories” are hierarchical dictionaries (so a discrete hierarchy index set plus a set of vectors) which perform a compression (by creating reconstructions of the data). Although they weren’t developed with this in mind, support vector machines also do this as well, since one finds a small number of “support vectors” that essentially allow you to compress the information about decision boundaries in classification problems (support vector machines are one of the very few things from machine learning that have had significant and successful impacts elsewhere since neural networks).
The hierarchical dictionaries learned do contain a “theory” of the visual world in a sense, although an important idea is that they do so in a way that is sensitive to the application at hand. There is much left out by Daniel_Burfoot about how people actually go about implementing this line of thought.