datadataeverywhere comments on Frugality and working from finite data

datadataeverywhere 5 Sep 2010 21:32 UTC
3 points
You seem very confident in both those points. Can you justify that? I’m familiar with (both implemented and reverse-engineered) both generic and specialized compressions algorithms, and I don’t personally see a way to assure that (accurate) cosmological models substantially outperform generic compression, at least without loss. On the other hand, I have little experience with astronomy, so please correct me where I’m making inaccurate assumptions.

I’m imagining this database to be structured so that it holds rows along the lines of (datetime, position, luminosity by spectral component). Since I don’t have a background in astronomy, maybe that’s a complete misunderstanding. However, I see this as holding an enormous number of events, each of which consists of a small amount of information, most of which are either unrelated to other events in the database or trivially related so that very simple rules would predict better than trying to model all of the physical processes occurring in the stars [or whatever] that were the source of the event.

Part of the reason I feel this way is that we can gather so little information; the luminosity of a star varies, and we understand at least some about what can make it vary, but I am currently under the impression that actually understanding a distant star’s internal processes is so far away from what we can gather from the little light we receive that most of the variance is expected but isn’t predictable. We don’t even understand our own Sun that well!

There is also the problem with weighing items; if I assume that an accurate cosmological model would work well, one that accurately predicts stellar life cycles but wholly misunderstands the acceleration of the expansion of the universe would do much better than a model that accurately captured all of that, but to even a small degree was less well fitted to observed stellar life cycles (even if it is more accurate and less over fitted). Some of the most interesting questions we are investigating right now are the rarest events; if we have a row in the database for each observable time period, you start with an absolutely enormous number of rows for each observable star, but once-in-a-lifetime events are what really intrigue and confound us; starting with so little data, compressing them is simply not worth the compressors time, relative to compressing the much better understood phenomena.