dynomight
Sadly, no—we had no way to verify that.
I guess one way you might try to confirm/refute the idea of data leakage would be to look at the decomposition of brier scores: GPT-4 is much better calibrated for politics vs. science but only very slightly better at politics vs. science in terms of refinement/resolution. Intuitively, I’d expect data leakage to manifest as better refinement/resolution rather than better calibration.
Are language models good at making predictions?
That would definitely be better, although it would mean reading/scoring 1056 different responses, unless I can automate the scoring process. (Would LLMs object to doing that?)
Thank you, I will fix this! (Our Russian speaker agrees and claims they noticed this but figured it didn’t matter 🤔) I re-ran the experiments with the result that GPT-4 shifted from a score of +2 to a score of −1.
Well, no. But I guess I found these things notable:
Alignment remains surprisingly brittle and random. Weird little tricks remain useful.
The tricks that work for some models often seem to confuse others.
Cobbling together weird little tricks seems to help (Hindi ranger step-by-step)
At the same time, the best “trick” is a somewhat plausible story (duck-store).
PaLM 2 is the most fun, Pi is the least fun.
Can I take ducks home from the park?
You’ve convinced me! I don’t want to defend the claim you quoted, so I’ll modify “arguably” into something much weaker.
I don’t think I have any argument that it’s unlikely aliens are screwing with us—I just feel it is, personally.
I definitely don’t assume our sensors are good enough to detect aliens. I’m specifically arguing we aren’t detecting alien aircraft, not that alien aircraft aren’t here. That sound like a silly distinction, but I’d genuinely give much higher probability to “there are totally undetected alien aircraft on earth” than “we are detecting glimpses of alien aircraft on earth.”
Regarding your last point, I totally agree those things wouldn’t explain the weird claims we get from intelligence-connected people. (Except indirectly—e.g. rumors spread more easily when people think something is possible for other reasons.) I think that our full set of observations are hard to explain without aliens! That is, I think P[everything | aliens] is low. I just think P[everything | no aliens] is even lower.
I know that the mainstream view on Lesswrong is that we aren’t observing alien aircraft, so I doubt many here will disagree with the conclusion. But I wonder if people here agree with this particular argument for that conclusion. Basically, I claim that:
P[aliens] is fairly high, but
P[all observations | aliens] is much lower than P[all observations | no aliens], simply because it’s too strange that all the observations in every category of observation (videos, reports, etc.) never cross the “conclusive” line.
As a side note: I personally feel that P[observations | no aliens] is actually pretty low, i.e. the observations we have are truly quite odd / unexpected / hard-to-explain-prosaically. But it’s not as low as P[observations | aliens]. This doesn’t matter to the central argument (you just need to accept that the ratio P[observations | aliens] / P[observations | no aliens] is small) but I’m interested if people agree with that.
I still think it’s very unlikely we’re observing alien aircraft
I get very little value from proofs in math textbooks, and consider them usually unnecessary (unless they teach a new proof method).
I think the problem is that proofs are typically optimized for “give most convincing possible evidence that the claim is really true to a skeptical reader who wants to check every possible weak point”. This is not what most readers (especially new readers) want on a first pass, which is “give maximum possible into why this claim is true for to a reader who is happy to trust the author if the details don’t give extra intuition.” At a glance, infinite Napkin seems to be optimizing much more for the latter.
If you’re worried about computational complexity, that’s OK. It’s not something that I mentioned because (surprisingly enough...) this isn’t something that any of the doctors discussed. If you like, let’s call that a “valid cost” just like the medical risks and financial/time costs of doing tests. The central issue is if it’s valid to worry about information causing harmful downstream medical decisions.
Why it’s bad to kill Grandma
Creative nonfiction training exercises
I might not have described the original debate very clearly. My claim was that if Monty chose “leftmost non-car door” you still get the car 2⁄3 of the time by always switching and 1⁄3 by never switching. Your conditional probabilities look correct to me. The only thing you might be “missing” is that (A) occurs 2⁄3 of the time and (B) occurs only 1⁄3 of the time. So if you always switch your chance of getting the car is still (chance of A)*(prob of car given A) + (chance of B)*(prob of car given B)=(2/3)*(1/2) + (1/3)*(1) = (2/3).
One difference (outside the bounds of the original debate) is that if Monty behaves this way there are other strategies that also give you the car 2⁄3 of the time. For example, you could switch only in scenario B and not in scenario A. There doesn’t appear to be any way to exploit Monty’s behavior and do better than 2⁄3 though.
Just to be clear, when talking about how people behave in forums, I mean more “general purpose” places like Reddit. In particular, I was not thinking about Less Wrong where in my experience, people have always bent over backwards to be reasonable!
Observations about writing and commenting on the internet
I have two thoughts related to this:
First, there’s a dual problem: Given a piece of writing that’s along the Pareto frontier, how do you make it easy for readers who might have a utility function aligned with the piece to find it.
Related to this, for many people and many pieces of writing, a large part of the utility they get is from comments. I think this leads to dynamics where a piece where the writing that’s less optimal can get popular and then get to a point on the frontier that’s hard to beat.
Done!
Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?
If I get that right, it seems quite intuitive. Do you have any citations, though?