Great explanation! I was linked here by someone after wondering why linear regression was asymmetric. While a quick google and a chatGPT could tell me that they are minimizing different things, the advantage of your post is the:
Pictures
Explanation of why minimizing different things will get you slopes differing in this specific way (that is, far outliers are punished heavily)
A connection to PCA that is nice and simply explained.
Thanks!
I think you could’ve done better with integration by parts.
In physics, integration by parts is usually applied for a definite integral in which you can neglect the uv term. Thus, integration by parts reads: “The integral of udv = integral of -vdu, that is, you can trade what you differentiate in a product, as long as the functions in question have a small integral over the boundary”.
Common examples are when you integrate over some big volume, as most physical quantities are very small far away from the stuff.
I also think the intuition behind Bayes rule as usually interpreted here on LW, that is, it provides the updating rule posterior odds = prior odds*likelihood ratio and thereby also provides a formalization of how good evidence is. As for the derivation from P(A|B) defined as equal to P(A and B)/P(B), I think this is best described by saying that P(A|B) is the probability of A once you know B, so you take the mass associated to the worlds where A is true once B is true and compare to your total mass, which is the mass associated to the worlds where B is true. The former is really just “mass of A and B”, so you are done.
Now, P(A and B) = P(B)P(A|B), which I think of as “First, take probability B is true, then given that we are in this set of worlds, take the probability that A is true”. Essentially translating from locating sets to probabilities.
From here, Bayes theorem is the simple fact that A and B = B and A. So P(B)P(A|B) = P(A and B) = P(A)P(B|A). If you draw a square with 4 rectangles where the first row is P(A), where the second row is P(-A), where the first column is P(B), and where the second is P(-B), and each rectangle represents a possibility like P(A and -B), then this equation just splits the rectangle P(A and B) into (rectangle compared to row) * row = (rectangle compared to column) * column. Divide by P(B) (that is, the row) to get Bayes law.
For the sine rule, I think it also helps to show that the fraction a/sin(a) is the diameter of the circumcircle. Wikipedia has good pictures.
For an extra math fact that totally doesn’t need to be in the post, it is interesting that for spherical triangles, the law of sines just needs to be modified so that you take the sine of the lengths as well. In fact you can do similar in hyperbolic space (by using sinh), and there’s a taylor series form involving the curvature for a version of sine that makes the law of sines still true in any constant curvature space. (you can find this on the same wiki page).