Said Achmiz comments on I Don’t Know How To Count That Low

Said Achmiz 23 Oct 2021 1:49 UTC
31 points

it was widely verified if you took a number of single-digit numbers and calculated their standard deviation, and then took the same numbers and added a million to them, the standard deviation was often different, when it should be exactly the same.

This bug does not exist in Excel for Mac 2011.
- Elizabeth 23 Oct 2021 3:21 UTC
  13 points
  Parent
  Virtue points for checking! For the sake of thoroughness: how many values did you use? A comment from my FB wall suggests this comes from a hack for handling very large data sets.
  The numerically-unstable stdev is a classic. I’ve run into it in ML stuff. It happens when you try to be clever and compute the stdev in one pass through the data without computing the mean beforehand… Unless you use the really cool black-magic trick that lets you do this in a more stable way.
  In an ML context this happens if you want the running stdev of a stream of data you can’t afford to store in memory (so you can’t find the mean and then go back and start from the beginning). Excel has no such excuse.
  - Said Achmiz 23 Oct 2021 3:27 UTC
    23 points
    Parent
    Update: I just tried it with 100,000 values. Still no bug.
    
    Update 2: I’ve now tried it with 1,000,000 values. Everything is still correct.
    - IGotNerdSniped 25 Oct 2021 18:30 UTC
      17 points
      Parent
      I tried a set of 104,857,600 values in Excel 2010 for Windows and the standard deviations diverge at the 9th decimal:
      Count 1: 104,857,600
      Count 2: 104,857,600
      Sum 1: 5,242,888,545
      Sum 2: 104,862,842,888,545
      Average 1: 50.00008149
      Average 2: 1,000,050.00008149
      Stdev.p 1: 29.15276094096330000
      Stdev.p 2: 29.15276094160490000
      - DPiepgrass 27 Oct 2021 23:06 UTC
        2 points
        Parent
        As a software developer, I completely expect this; it’s just an everyday consequence of limited-precision math. You’ll get the same kind of anomalies in almost any programming language, and other kinds of calculations can produce far worse inaccuracies than this (seems kind of odd to single out standard deviations which are relatively well-behaved, but maybe Excel used to have a worse algorithm). John Gustafson proposed using clusters of unums and then posits to mitigate the problem (preferably in hardware), but it’s not practical to eliminate it entirely.
  - Said Achmiz 23 Oct 2021 3:23 UTC
    6 points
    Parent
    I used 100 values. How many is “very large”? 1,000? 1,000,000? More?