johnswentworth comments on SGD’s Bias

johnswentworth May 19, 2021, 3:22 PM
LW: 5 AF: 3
AF
I’m still wrapping my head around this myself, so this comment is quite useful.
Here’s a different way to set up the model, where the phenomenon is more obvious.
Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let’s assume it’s a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state $k$ , it’s equally likely to transition to $k + 1$ or $k - 1$ ). The rate $λ_{k}$ at which the system leaves state $k$ serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state $k$ and state $k - 1$ satisfy
$p_{k} λ_{k} = p_{k - 1} λ_{k - 1}$
… i.e. the flux from values $k$ —and-above to values-below- $k$ is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if $λ_{k} > λ_{k - 1}$ , then $p_{k} < p_{k - 1}$ : the system spends more time in lower-diffusion states (locally). Similarly, if the system’s state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).
Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won’t go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.
I haven’t looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn’t that mean that it does spend more time on the lower-noise side? Because presumably it’s typically further from zero on the higher-noise side. (More generally, I don’t think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)
Also, one simple way to see that the “drift” interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave—i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.
- Steven Byrnes May 19, 2021, 7:22 PM
  LW: 4 AF: 3
  AF Parent
  That makes sense. Now it’s coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you’ll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they’re carrying the temperature from their last collision. Whereas in uniform temperature, there’s “detailed balance” (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).
  Thinking about the diode-resistor thing more, I suspect it would be a waste-of-time nerd-sniping rabbit hole, especially because of the time-domain aspects (the fluctuations are faster on one side vs the other I think) which don’t have any analogue in SGD. Sorry to have brought it up.