dynomight
Do you happen to have any recommended pointers for research on health impacts of processed food? It’s pretty easy to turn up a few recent meta reviews, which seems like a decent place to start, but I’d be interested if there were any other sources, particularly influential individual experiments, etc. (It seems like there’s a whole lot of observational studies, but many fewer RCTs, for reasons that I guess are pretty understandable.) It seems like some important work here might never use the word “processing”.
If I hadn’t heard back from them, would you want me to tell you? Or would that be too sad?
Seed oils are usually solvent extracted, which makes me wonder, how thoroughly are they scrubbed of solvent, what stuff in the solvent is absorbed into the oil (also an effective solvent for various things), etc
I looked into this briefly at least for canola oil. There, the typical solvent is hexane. And some hexane does indeed appear to make it into the canola oil that we eat. But hexane apparently has very low toxicity, and—more importantly—the hexane that we get from all food sources apparently makes up less than 2% of our total hexane intake! https://www.hsph.harvard.edu/nutritionsource/2015/04/13/ask-the-expert-concerns-about-canola-oil/ Mostly we get hexane from gasoline fumes, so if hexane is a problem, it’s very hard to see how to pin the blame on canola oil.
It’s a regression. Just like they extrapolate backwards to (1882+50=1932) using data from 1959, they extrapolate forwards at the end. (This is discussed in the “timelines” section.) This is definitely a valid reason to treat it with suspicion, but nothing’s “wrong” exactly.
Many thanks! All fixed (except one that I prefer the old way.)
As the original author of underrated reasons to be thankful (here), I guess I can confirm that tearing apart the sun for raw materials was not an intended implication.
I think matplotlib has way too many ways to do everything to be comprehensive! But I think you could do almost everything with some variants of these.
ax.spines['top'].set_visible(False) # or 'left' / 'right' / 'bottom' ax.set_xticks([0,50,100],['0%','50%','100%']) ax.tick_params(axis='x', left=False, right=False) # or 'y' ax.set_ylim([0,0.30]) ax.set_ylim([0,ax.get_ylim()[1]])
Good point regarding year tick marks! I was thinking think that labeling 0°C would make the most sense when freezing is really important. Say, if you were plotting historical data on temperatures and you were interested in trying to estimate the last frost date in spring or something. Then, 10°C would mean “twice as much margin” as 5°C.
One way you could measure which one is “best” would be to measure how long it takes people to answer certain questions. E.g. “For what fraction of the 1997-2010 period did Japan spend more on healthcare per-capita than the UK?” or “what’s the average ratio of healthcare spending in Sweden vs. Greece between 2000 and 2010?” (I think there is an academic literature on these kinds of experiments, though I don’t have any references on hand.)
In this case, I think Tufte goes overboard in saying you shouldn’t use color. But if the second plot had color, I’d venture it would win most such contests, if only because the y-axis is bigger and it’s easier to match the lines with the labels. But even if I don’t agree with everything Tufte says, I still find him useful because he suggests different options and different ways to think about things.
Thanks, someone once gave me the advice that after you write something, you should go back to the beginning and delete as many paragraphs as you can without making everything incomprehensible. After hearing this, I noticed that most people tend to write like this:
Intro
Context
Overview
Other various throat clearing
Blah blah blah
Finally an actual example, an example, praise god
Which is pretty easy to correct once you see it!
Hey, you might be right! I’ll take this as useful feedback that the argument wasn’t fully convincing. Don’t mean to pull a motte-and-bailey, but I suppose if I had to, I’d retreat to an argument like, “if making a plot, consider using these rules as one option for how to pick axes.” In any case, if you have any examples where you think following this advice leads to bad choices, I’d be interested to hear them.
I think you’re basically right: Correlation is just one way of measuring dependence between variables. Being correlated is a sufficient but not necessary condition for dependence. We talk about correlation so much because:
We don’t have a particularly convenient general scalar measure of how related two variables are. You might think about using something like mutual information, but for that you need the densities not datasets.
We’re still living in the shadows of the times when computers weren’t so big. We got used to doing all sorts of stuff based on linearity decades ago because we didn’t have any other options, and they became “conventional” even when we might have better options now.
Thanks, you’ve 100% convinced me. (Convincing someone that something that (a) is known to be true and (b) they think isn’t surprising, actually is surprising is a rare feat, well done!)
Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration.
Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?
If I get that right, it seems quite intuitive. Do you have any citations, though?
Sadly, no—we had no way to verify that.
I guess one way you might try to confirm/refute the idea of data leakage would be to look at the decomposition of brier scores: GPT-4 is much better calibrated for politics vs. science but only very slightly better at politics vs. science in terms of refinement/resolution. Intuitively, I’d expect data leakage to manifest as better refinement/resolution rather than better calibration.
That would definitely be better, although it would mean reading/scoring 1056 different responses, unless I can automate the scoring process. (Would LLMs object to doing that?)
Thank you, I will fix this! (Our Russian speaker agrees and claims they noticed this but figured it didn’t matter 🤔) I re-ran the experiments with the result that GPT-4 shifted from a score of +2 to a score of −1.
Well, no. But I guess I found these things notable:
Alignment remains surprisingly brittle and random. Weird little tricks remain useful.
The tricks that work for some models often seem to confuse others.
Cobbling together weird little tricks seems to help (Hindi ranger step-by-step)
At the same time, the best “trick” is a somewhat plausible story (duck-store).
PaLM 2 is the most fun, Pi is the least fun.
You’ve convinced me! I don’t want to defend the claim you quoted, so I’ll modify “arguably” into something much weaker.
I would dissuade no one from writing drunk, and I’m confident that you too can say that people are penguins! But I’m sorry to report that personally I don’t do it by drinking but rather writing a much longer version with all those kinds of clarifications included and then obsessively editing it down.