gwern comments on Blessed information, garbage information, cursed information

gwern 19 Apr 2024 17:06 UTC
3 points
1

It might be tempting to think you could use multivariate statistics like factor analysis to distill garbage information by identifying axes which give you unusually much information about the system. In my experience, that doesn’t work well, and if you think about it for a bit, it becomes clear why: if the garbage information has a 50 000 : 1 ratio of garbage : blessed, then finding an axis which explains 10 variables worth of information still leaves you with a 5 000 : 1 ratio of garbage : blessed. The distillation you get with such techniques is simply not strong enough.[1][2]

That doesn’t seem like it. In many contexts, a 10x saving is awesome and definitely a ‘blessed’ improvement if you can kill 90% of the noise in anything you have to work with. But you don’t want to do that with logs. You can’t distill information in advance of a bug (or anomaly, or attack) because a bug by definition is going to be breaking all of the past behavior & invariants governing normal behavior that any distillation was based on. If it didn’t, it would usually be fixed already. (“We don’t need to record variable X in the log, which would be wasteful accurst clutter, because X cannot change.” NARRATOR: “X changed.”) The logs are for the exceptions—which are precisely what any non-end-to-end lossy compression (factor analysis or otherwise) will correctly throw out information about to compress as residuals to ignore in favor of the ‘signal’. Which is why the best debugging systems like time-travel debugging or the shiny new Antithesis work hard to de facto save everything.
- tailcalled 19 Apr 2024 18:25 UTC
  2 points
  0
  Parent
  I’d say “in many contexts” in practice refers to when you are already working with relatively blessed information. It’s just that while most domains are overwhelmingly filled with garbage information (e.g. if you put up a camera at a random position on the earth, what it records will be ~useless), the fact that they are so filled with garbage means that we don’t naturally think of them as being “real domains”.
  Basically, I don’t mean that blessed information is some obscure thing that you wouldn’t expect to encounter, I mean that people try to work with as much blessed information as possible. Logs were sort of a special case of being unusually-garbage.
  You can’t distill information in advance of a bug (or anomaly, or attack) because a bug by definition is going to be breaking all of the past behavior & invariants governing normal behavior that any distillation was based on.
  Depends. If the system is very buggy, there’s gonna be lots of bugs to distill from. Which bring us to the second part...
  The logs are for the exceptions—which are precisely what any non-end-to-end lossy compression (factor analysis or otherwise) will correctly throw out information about to compress as residuals to ignore in favor of the ‘signal’.
  Even if lossy compression threw out the exceptions we were interested in as being noise, that would actually still be useful as a form of outlier detection. One could just zoom in on the biggest residuals and check what was going on there.
  Issue is, the logs end up containing ~all the exceptions, including exceptional user behavior and exceptional user setups and exceptionally error-throwing non-buggy code, but the logs are only useful for bugs/attacks/etc. because the former behaviors are fine and should be supported.