The definitions of information I have in mind are some sort of classification or generation loss function. So for instance, if you trained GPT-3 on the second drive, I would expect it to get lower loss on the datasets it was evaluated on (some sort of mixture of wikipedia and other internet scrapes, if I recall correctly) than it would if it was trained on the first drive. So by the measures I have in mind, the second drive would very possibly contain more information.
(Though of course this depends on your exact loss function.)
Like my post is basically based on the observation that we often train machine learning systems with some sort of information-based loss, whether that be self-supervised/generative or fully supervised or something more complicated than that. Even if you achieve a better loss on your model, you won’t necessarily achieve a better reward for your agent.
The definitions of information I have in mind are some sort of classification or generation loss function. So for instance, if you trained GPT-3 on the second drive, I would expect it to get lower loss on the datasets it was evaluated on (some sort of mixture of wikipedia and other internet scrapes, if I recall correctly) than it would if it was trained on the first drive. So by the measures I have in mind, the second drive would very possibly contain more information.
(Though of course this depends on your exact loss function.)
Like my post is basically based on the observation that we often train machine learning systems with some sort of information-based loss, whether that be self-supervised/generative or fully supervised or something more complicated than that. Even if you achieve a better loss on your model, you won’t necessarily achieve a better reward for your agent.