Really interesting post. To me, approaching information with mathematics seems like a black box—and in this post, it feels like magic.
I’m a little confused by the concept of cost: I understand that it takes more data to represent more complex systems, which grows exponentially faster than than the amount of bits. But doesn’t the more complex model still strictly fit the data better? - is it just trying to go for a different goal than accuracy? I feel like I’m missing the entire point of the end.
I am not sure whether my take on this is correct, so I’d be thankful if someone corrects me if I am wrong:
I think that if the goal was only ‘predicting’ this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.
In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models’ fit to the bit-sequence.
Really interesting post. To me, approaching information with mathematics seems like a black box—and in this post, it feels like magic.
I’m a little confused by the concept of cost: I understand that it takes more data to represent more complex systems, which grows exponentially faster than than the amount of bits. But doesn’t the more complex model still strictly fit the data better? - is it just trying to go for a different goal than accuracy? I feel like I’m missing the entire point of the end.
I am not sure whether my take on this is correct, so I’d be thankful if someone corrects me if I am wrong:
I think that if the goal was only ‘predicting’ this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.
In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models’ fit to the bit-sequence.