It’s not a mathematical argument, but here I first came across such an analogy drawn between training of neural networks and evolution, and a potential interpretation of what it means in terms of sample-(in)efficiency.
Steveot
I thought about Agency Q4 (counterargument to Pearl) recently, but couldn’t come up with anything convincing. Does anyone have a strong view/argument here?
I like the idea a lot.
However, I really need simple systems in my work routine. Things like “hitting a stopwatch, dividing by three, and carrying over previous rest time” already feels like it’s a lot. Even though it’s just a few seconds, I prefer if these systems take as little energy as possible to maintain.
What I thought was using a simple shell script: Just start it at the beginning of work, and hit a random key whenever I switch from work to rest or vice versa. It automatically keeps track of my break times.
I don’t have Linux at home, but what I tried online ( https://www.onlinegdb.com/online_bash_shell ) is the following: (I am terrible at shell script so this is definitely not optimal, but I want to try something like this in the coming weeks. Perhaps one may want an additional warning or alarm sound if the break time gets below 0, but for me just “keeping track” is enough I think)
convertsecs() {
((h=${1}/3600))
((m=(${1}%3600)/60))
((s=${1}%60))
printf “%02d:%02d:%02d\n” $h $m $s
}function flex_pomo() {
current=0
resttime=0
total=0while true; do
until read -s -n 1 -t 0.01; do
sleep 3
current=$(( $current + 3 ))
resttime=$(( $resttime + 1 ))
total=$(( $total + 3 ))
printf “\rCurrently working: Current interval: $(convertsecs $current), accumulated rest: $(convertsecs $resttime), total worktime: $(convertsecs $total) ”
done
printf “\nSwitching to break\n”
current=0
until read -s -n 1 -t 0.01; do
sleep 3
current=$(( $current + 3 ))
resttime=$(( $resttime − 3 ))
printf “\rCurrently resting: Current interval: $(convertsecs $current), accumulated rest: $(convertsecs $resttime), total worktime: $(convertsecs $total) ”
done
printf “\nSwitching to work\n”
current=0
done
}flex_pomo
Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high probability (i.e., as you say, is the data), while the learning bound or loss reduction is given for .
Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how and depend on (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when . Minor comment: I think ?
The main thing that caught my attention was that random variables are often assumed to be independent. I am not sure if it is already included, but if one wants to allow for adding, multiplying, taking mixtures etc of random variables that are not independent, one way to do it is via copulas. For sampling based methods, working with copulas is a way of incorporating a moderate variety of possible dependence structures with little additional computational cost.
The basic idea is to take a given dependence structure of some tractable multivariate random variable (e.g., one where we can produce samples quickly, like a multivariate Gaussian) and transfer its dependence structure to the individual one-dimensional distributions one likes to add, multiply, etc.
Another intuition I often found useful: KL-divergence behaves more like the square of a metric than a metric.
The clearest indicator of this is that KL-divergence satisfies a kind of Pythagorean theorem established in a paper by Csiszár (1975), see https://www.jstor.org/stable/2959270#metadata_info_tab_contents . The intuition is exactly the same as for the euclidean case: If we project a point A onto a convex set S (say the projection is B), and if C is another point in the set S, then the standard Pythagorean theorem would tell us that the angle of the triangle ABC at B is larger than 90 degree, or in other words |A−C|2>=|A−B|2+|B−C|2. And the same holds if we project with respect to KL divergence, and we end up having DKL(C,A)>=DKL(B,A)+DKL(C,B).
This has implications if you think about things like sample efficiency (instead of a square root rate as usual, convergence rates with KL divergence usually behave like 1/n).
This is also reflected in the relation between KL divergence and other distances for probability measures, like total variation or Wasserstein distance. The most prominent example would be Pinsker’s inequality in this regard, stating that the total variation norm between two measures is bounded by a constant times the square root of the KL-divergence between the measures.