Vitor comments on Open Thread March 21 - March 27, 2016

Vitor 22 Mar 2016 16:18 UTC
19 points
Your problem is called a clustering problem. First of all, you need to answer how you measure your error (information loss, as you call it). Typical error norms used are l1 (sum of individual errors), l2 (sum of squares of errors, penalizes larger errors more) and l-infinity (maximum error).

Once you select a norm, there always exists a partition that minimizes your error, and to find it there are a bunch of heuristic algorithms, e.g. k-means clustering. Luckily, since your data is one-dimensional and you have very few categories, you can just brute force it (for 4 categories you need to correctly place 3 boundaries, and naively trying all possible positions takes only n^3 runtime)

Hope this helps.
- Stefan_Schubert 22 Mar 2016 20:45 UTC
  5 points
  Parent
  Thanks a lot! Yes, super-useful.
- gjm 22 Mar 2016 16:41 UTC
  5 points
  Parent
  A possibly relevant paper for anyone wanting to do this in one dimension to a dataset large enough that they care about efficiency.