Steven Byrnes comments on SGD’s Bias

Steven Byrnes 19 May 2021 12:19 UTC
LW: 18 AF: 13
AF
I think the “drift from high-noise to low-noise” thing is more subtle than you’re making it out to be… Or at least, I remain to be convinced. Like, has anyone else made this claim, or is there experimental evidence?
In the particle diffusion case, you point out correctly that if there’s a gradient in D caused by a temperature gradient, it causes a concentration gradient. But I believe that if there’s a gradient in D caused by something other than a temperature gradient, then it doesn’t cause a concentration gradient. Like, take a room with a big pile of rocks in it. Intuitively you can say “when air molecules get deep into the pore spaces of the rock pile, they’re stuck and can’t easily get out, therefore I expect higher air concentration in the pore spaces of the rock pile than the open part of the room”, but that would be wrong, the concentration is the same (as required by thermodynamics when the temperature is constant).
Another example, which is fun albeit more abstract: The classical diode-resistor generator circuit (see Gunn, Sokolov—email me if you want the PDFs). If you connect a resistor at nonzero temperature to a diode at absolute zero, the diode rectifies the resistor’s Johnson noise and you get a net current in the diode’s forward direction. Conversely, if you connect a resistor at absolute zero to a diode at nonzero temperature, you get a net current in the diode’s reverse direction! And (obviously), if the temperatures of the diode and resistor are the same, there’s no net current. The dynamics look like a 1D drift-diffusion thing, where the voltage is bouncing around by Johnson noise, while getting pulled back towards zero. But, by the Johnson noise equation, the voltage bounces around a lot more when it has one sign than the opposite sign (assuming the diode IV curve has an infinitely sharp bend at V=0), because the total circuit resistance is different. And yet, the voltage does not systematically go to the less-bouncy side, instead it averages to zero. (Note: the frequency spectrum of the fluctuations changes too depending on the sign of the voltage.) I actually did a time-domain simulation the diode-resistor generator once, a couple jobs ago (I was writing a paper kinda related to it), but was never 100% satisfied that I understood why temperature-related ∇D’s can do things that constant-temperature ∇D’s can’t.
Very speculatively, I wonder if it matters whether you evaluate the gradient at the source of a random step, vs at the destination??
- johnswentworth 19 May 2021 15:22 UTC
  LW: 5 AF: 3
  AF Parent
  I’m still wrapping my head around this myself, so this comment is quite useful.
  Here’s a different way to set up the model, where the phenomenon is more obvious.
  Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let’s assume it’s a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state $k$ , it’s equally likely to transition to $k + 1$ or $k - 1$ ). The rate $λ_{k}$ at which the system leaves state $k$ serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state $k$ and state $k - 1$ satisfy
  $p_{k} λ_{k} = p_{k - 1} λ_{k - 1}$
  … i.e. the flux from values $k$ —and-above to values-below- $k$ is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if $λ_{k} > λ_{k - 1}$ , then $p_{k} < p_{k - 1}$ : the system spends more time in lower-diffusion states (locally). Similarly, if the system’s state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).
  Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won’t go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.
  I haven’t looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn’t that mean that it does spend more time on the lower-noise side? Because presumably it’s typically further from zero on the higher-noise side. (More generally, I don’t think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)
  Also, one simple way to see that the “drift” interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave—i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.
  - Steven Byrnes 19 May 2021 19:22 UTC
    LW: 4 AF: 3
    AF Parent
    That makes sense. Now it’s coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you’ll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they’re carrying the temperature from their last collision. Whereas in uniform temperature, there’s “detailed balance” (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).
    Thinking about the diode-resistor thing more, I suspect it would be a waste-of-time nerd-sniping rabbit hole, especially because of the time-domain aspects (the fluctuations are faster on one side vs the other I think) which don’t have any analogue in SGD. Sorry to have brought it up.