The random jittering reminds me of the random movements of the stock market: As new information trickles in, the estimate of the optimal point jitters around noisily, rather than following a smooth trajectory. If the value being estimated is Utility(action A) - Utility(action B), then we would expect the agent to jitter between the two actions when the estimate is near zero, like some sort of random walk repeatedly crossing the axis.
The random jittering reminds me of the random movements of the stock market: As new information trickles in, the estimate of the optimal point jitters around noisily, rather than following a smooth trajectory. If the value being estimated is Utility(action A) - Utility(action B), then we would expect the agent to jitter between the two actions when the estimate is near zero, like some sort of random walk repeatedly crossing the axis.