Daniel_Burfoot comments on Frugality and working from finite data

Daniel_Burfoot 5 Sep 2010 17:27 UTC
0 points

So every theory has to cover all of cosmology? Most astrophysicists study really specific things, and make advances in those;

In this case, the researcher would take the current standard compressor/model and make a modification to a single module or component of the software, and then show that the modification leads to improved codelengths.

No physical theory will outperform 7zip (or whatever), because to get all the digits right it will need those or correction factors of nearly that size anyway.

You’re getting at an important subtlety, which is that if the observations include many digits of precision, it will be impossible to achieve good compression in absolute terms. But the absolute rate is irrelevant; the point is to compare theories. So maybe theory A can only achieve 10% compression, but if the previous champion only gets 9%, then theory A should be preferred. But a specialized compressor based on astrophysical theories will outperform 7zip on a database of cosmological observations, though maybe by only a small amount in absolute terms.
- datadataeverywhere 5 Sep 2010 21:32 UTC
  3 points
  Parent
  You seem very confident in both those points. Can you justify that? I’m familiar with (both implemented and reverse-engineered) both generic and specialized compressions algorithms, and I don’t personally see a way to assure that (accurate) cosmological models substantially outperform generic compression, at least without loss. On the other hand, I have little experience with astronomy, so please correct me where I’m making inaccurate assumptions.
  
  I’m imagining this database to be structured so that it holds rows along the lines of (datetime, position, luminosity by spectral component). Since I don’t have a background in astronomy, maybe that’s a complete misunderstanding. However, I see this as holding an enormous number of events, each of which consists of a small amount of information, most of which are either unrelated to other events in the database or trivially related so that very simple rules would predict better than trying to model all of the physical processes occurring in the stars [or whatever] that were the source of the event.
  
  Part of the reason I feel this way is that we can gather so little information; the luminosity of a star varies, and we understand at least some about what can make it vary, but I am currently under the impression that actually understanding a distant star’s internal processes is so far away from what we can gather from the little light we receive that most of the variance is expected but isn’t predictable. We don’t even understand our own Sun that well!
  
  There is also the problem with weighing items; if I assume that an accurate cosmological model would work well, one that accurately predicts stellar life cycles but wholly misunderstands the acceleration of the expansion of the universe would do much better than a model that accurately captured all of that, but to even a small degree was less well fitted to observed stellar life cycles (even if it is more accurate and less over fitted). Some of the most interesting questions we are investigating right now are the rarest events; if we have a row in the database for each observable time period, you start with an absolutely enormous number of rows for each observable star, but once-in-a-lifetime events are what really intrigue and confound us; starting with so little data, compressing them is simply not worth the compressors time, relative to compressing the much better understood phenomena.