I’m confused by the setup. Let’s consider the simplest case: fitting points in the plane, y as a function of x. If I have three datapoints and I fit a quadratic to it, I have a dimension 0 space of minimizers of the loss function: the unique parabola through those three points (assume they’re not ontop of each other). Since I have three parameters in a quadratic, I assume that this means the effective degrees of freedom of the model is 3 according to this post. If I instead fit a quartic, I now have a dimension 1 space of minimizers and 4 parameters, so I think you’re saying degrees of freedom is still 3. And so the DoF would be 3 for all degrees of polynomial models above linear. But I certainly think that we expect that quadratic models will generalize better than 19th degree polynomials when fit to just three points.
I think the objection to this example is that the relevant function to minimize is not loss on the training data but something else? The loss it would have on ‘real data’? That seems to make more sense of the post to me, but if that were the case, then I think any minimizer of that function would be equally good at generalizing by definition. Another candidate would be the parameter-function map you describe which seems to be the relevant map whose singularities we are studying, but we it’s not well defined to ask for minimums (or level-sets) of that at all. So I don’t think that’s right either.
tgb
Thanks for the clarification! In fact, that opinion wasn’t even one of the ones I had considered you might have.
I simultaneously would have answered ‘no,’ would expect most people in my social circles to answer no, think it is clear that this being a near-universal is a very bad sign, and also that 25.6% is terrifying. It’s something like ‘there is a right amount of the thing this is a proxy for, and that very much is not it.’
At the risk of being too honest, I find passages written like this horribly confusing and never know what you mean when you write like this. (“this” being near universal—what is “this”? (“answering no” like you and your friends or “answering yes” like most of the survey respondents?) 25.6% is terrifying because you think it is high or low? What thing do you think “this” is a proxy for?)
For me, the survey question itself seems bad because it’s very close to two radically different ideas:
- I base my self-worth on my parent’s judgement of me.
- My parents are kind, intelligent people whose judgement making is generally of very high quality. Since they are also biased towards positive views of me, if they judged me poorly then I would take that as serious evidence that I am not living up to what I aspire of myself.
The first sounds unhealthy. The second sounds healthy—at least assuming that one’s parents are in fact kind, intelligent, and generally positively disposed to their children at default. I’m not confident which of the two a “yes” respondent is agreeing to or a “no” is disagreeing with.
Thanks. I think I’ve been tripped up by this terminology more than once now.
Not sure that I understand your claim here about optimization. An optimizer is presumably given some choice of possible initial states to choose from to achieve its goal (otherwise it cannot interact at all). In which case, the set of accessible states will depend upon the chosen initial state and so the optimizer can influence long term behavior and choose whatever best matches it’s desires.
Why would CZ tweet out that he was starting to sell his FTT? Surely that would only decrease the amount he could recover on his sales?
I agree, I was just responding to your penultimate sentence: “In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place?”
Personally, I think it’s kind of exciting to be part of what might be the last breath of purely human writing. Also, depressing.
Surely the problem is that someone else is generating it—or more accurately lots of other people generating it in huge quantities.
I work in a related field and found this a helpful overview that filled in some gaps of my knowledge that I probably should have known already and I’m looking forward to the follow ups. I do think that this would likely be a very hard read for a layman who wasn’t already pretty familiar with genetics and you might consider making an even more basic version of this. Lots of jargon is dropped without explanation, for example.
Your graph shows an ~40% risk compared to the normal day in that age group. Using their risk ratio you would need about 25x times the child pedestrian activity to achieve that risk reduction. That could be the case, but I’m not certain. I’m not even that confident that you’d get the >10x needed to ensure a decrease in risk. Kids tend to go to hot spots for trick-or-treating, so the really busy streets that get >25x and spring to mind easily might be hiding the (relatively) depleted streets elsewhere that account for a larger fraction of typical walking. Hence I think your presentation is optimistic: it’s right to push back on the raw numbers but I don’t think it’s clear that Halloween is substantially safer than other nights per pedestrian-hour as you claim.
I also read the denominator problem differently. I took your argument to claim that 5x number to be a lower bound for the “trick-or-treating streets compared to the same streets on a typical night” and for that, it’s definitely true. But then you had to gloss over the fact that we’re comparing entire days (and non-trick-or-treating streets) and it’s much less clear that 5x is true for all-of-Halloween compared to all-of-another-day. Therefore, their analysis justified using your 5x number while I think your analysis was stretching the truth.
While I appreciate the analysis, I also recently saw this article circulating: https://jamanetwork.com/journals/jamapediatrics/article-abstract/2711459
It compares just 6pm-midnight on Halloween versus the corresponding time one week early and one week later. They estimate a 10x increase in deaths in age 4-8 children—see Figure 1. This doesn’t look like subgroup fishing since the 9-12 group is also quite large (6x increase). By your 5x correction factor, Halloween would still be more dangerous than other days for kids.I still think it could be true that Halloween is less dangerous since this hasn’t measured pedestrian activity and trick-or-treat really might be a greater than 10x increase in 4-8 year olds out on the street. But this definitely makes it look less good to me than your presentation.
Gene drives (I.e. genes that force their own propagation) do arise in nature. There are “LINE” genes that apparently make up over 20% of our genome: they encode RNA that encodes a protein that takes its own RNA and copies it back into your DNA at random locations, thereby propagating itself even more than our engineered gene drives do. With it taking up that much of our genome, I could imagine something like that killing off a species, though I’m failing to find a specific example.These are examples of selfish genes, so that might be where to read more.
It only causes female sterility, so the males keep passing it on. It reaches the whole population because the gene encodes a protein that affects the DNA and ensures it’s inheritance, rather than being a fifty fifty. If a modified and unmodified mate, then their offspring have only one copy of the modified DNA and one copy of the unmodified. They would have only a fifty fifty chance of passing that on. But if the gene has the effect of breaking other (nonmodified) copy, then the organisms natural DNA repair mechanisms will copy from the other chromosome to repair the damage. That copies the modified gene over! Now it has only the modified DNA and will pass it on with 100% chance. So will it’s offspring, forever, until there are no nonsterile females.
That looks right mathematically but seems absurd. Maybe steady state isn’t the right situation to think about this in? It’s weird that the strategy of “never reproduce” would be just as good as the usual, since not reproducing means not dying. Or we need to model the chance that the bamboo dies due to illness/fire/animals prior to getting a chance to reproduce?
Very interesting. Seems like the growth rate equations are off. Since the trees die off after giving off their seeds, population is just (mp)^2 after two generations. In steady state, mp will always have to be about 1, which puts a somewhat high bar on s to make it worth it (can you really double seed production by waiting twice as long?).
And where do the bamboo store all these seed producing resources for so long?
Similarly, CVS currently sells homeopathic ‘medicine’, marketed exactly as if it were real medicine. https://www.cvs.com/shop/content/homeopathic-remedies If you didn’t know what “homeopathic” meant, could you tell that these were fake medicines? They say things like “The ONLY clinically proven cold shortening nasal swab” on them. I think some of these might also contain real medicine at normal doses, but I frankly can’t tell for certain. So why do we expect CVS to become more benevolent when all the regulations are gone than it currently is?
Beautiful! That’s also a nice demonstration of B=C2.
I think your left diagram is correct but the one for C2 is off somewhat. In both, we’re conditioning on the statement that “you have an ace of spades”, so we’re exclusively looking in that top circle. Both C1 and C2 have the same exact grey shaded area. But in C2, some of the green shaded region inside that circle is also missing: the cases where you have an ace of spades but I happened to tell you about one of the other aces instead. So C2 is a subset of C1 (condition on being told you have the ace of spades) where only a randomly selected subset of the winning hands are chosen (1/2 of the ones with two aces, 1⁄3 of the ones with three, etc).
But that correction doesn’t really change much since your diagram is just the combination of four disjoint diagrams, one for each of the suits. So the ratio of grey to green is right, but I find it harder to compare to C1.
Either way, my main point was that C2 might have been driving our intuition that C=B, and in fact, C2=B, so our intuitions isn‘t doing too bad.
Say you’re dealt 13 out of 52 standard playing cards. Call the chance of getting two aces A. Now imagine a second round, in which I tell you that I know that you already have at least one ace in your hand. The chance of holding two aces in this scenario is B. Lastly, I tell you that the ace I know you’re holding happens to be the ace of spades. The chance of holding two aces is now C. Can you sort A, B and C?
Part of the reason this is so counter-intuitive is that this setup is actually ambiguous and the answer depends on how you interpret it. “The ace I know you’re holding” is worded poorly—I may be holding more than one ace! Consider the following options:
C1: I check whether you hold the ace of spades specifically and tell you either that you have it or that you do not have it.
C2: I check whether you hold at least one ace, and if so, I tell you the suit of a randomly chosen one of those aces, otherwise I tell you that you have no aces.
The probabilities of getting at least two aces, conditional on being told you have the ace of spades, is different in these two scenarios! In fact, B < C1 and B = C2. I get p(C1) = 56%, p(C2) = 36.9% = p(B). Maybe the intuition here is a little clearer, since we can see that winning hands that contain an ace of spades are all reported by C1 but some are not reported by C2, while all losing hands that contain an ace of spades are reported by both C1 and C2 (since there’s only one ace for C2 to choose from). So C2 is “enriches” for losing states when conditioning on being told that we have an ace of spades.
This is somewhat like the “Ignorant Monty” variant of the Monty Hall problem where Monty chooses a door (other than the contestant’s door) at random, potentially revealing either a goat or a car. Should you switch when he reveals a goat? If you haven’t seen this before, solve it yourself first—I found it as unintuitive as the original Monty Hall problem.
Thanks for trying to walk me through this more, though I’m not sure this clears up my confusion. An even more similar model to the one in the video (a pendulum) would be the model that y=(a+b)x2+cx+d which has four parameters a,b,c,d but of course you don’t really need both a and b. My point is that, as far as the loss function is concerned, the situation for a fourth degree polynomial’s redundancy is identical to the situation for this new model. Yet we clearly have two different types of redundancy going on:
Type A: like the fourth degree polynomial’s redundancy which impairs generalizability since it is merely an artifact of the limited training data, and
Type B: like the new model’s redundancy which does not impair generalizability compared to some non-redundant version of it since it is a redundancy in all outputs
Moreover, my intuition is that a highly over-parametrized neural net has much more Type A redundancy than Type B. Is this intuition wrong? That seems perhaps the definition of “over-parametrized”: a model with a lot of Type A redundancy. But maybe I instead am wrong to be looking at the loss function in the first place?