gwern
...The loss of knowledge has been attributed to several factors. Firstly, Lind showed in his work that there was no connection between the acidity of the citrus fruit and its effectiveness at curing scurvy. In particular, he noted that acids alone (sulphuric acid or vinegar), would not suffice. Despite this, it remained a popular theory that any acid could be used in place of citrus fruit. This misconception had significant consequences.
When the Royal Navy changed from using Sicilian lemons to West Indian limes, cases of scurvy reappeared. The limes were thought to be more acidic and it was therefore assumed that they would be more effective at treating scurvy. However, limes actually contain much less vitamin C and were consequently much less effective. Furthermore, fresh fruit was substituted with lime juice that had often been exposed to either air or copper piping. This resulted in at least a partial removal of vitamin C from the juice, thus reducing its effectiveness.
The discovery that fresh meat was able to cure scurvy was another reason why people no longer treated the condition with fresh fruit. This discovery led to the belief that perhaps scurvy was not caused by a dietary problem at all. Instead, it was thought to be the result of a bacterial infection from tainted meat. In fact, the healing properties of fresh meat come from the high levels of vitamin C it contains.
Finally, the arrival of steam shipping substantially reduced the amount of time people spent at sea, therefore the difficulties in carrying enough fresh produce were reduced. This decreased the risk of scurvy so that less effective treatments, such as lime juice, proved effective enough to deal with the condition most of the time. Unfortunately, this meant that knowledge of the most effective treatment for scurvy was gradually lost....
(Note for anyone confused why that Grok 3 archive snapshot looks ‘cut off’: there is a scroll-frame inside the page, which your browser may be hiding from you because everyone hides scrollbars these days. The conversation continues after “Hint: think like a number theorist / Thoughts”.)
A google of the first paragraph takes you quickly to https://www.bluesci.co.uk/posts/forgotten-knowledge
I stumbled upon deep in the bowels of https://gwern.net/ which I’ve annoyingly never been able to find again.
Probably https://gwern.net/newsletter/2021/05#master-synthesis
I wouldn’t have guessed just from the landing page that he’s the discoverer of backprop, respected former program director at the NSF, etc.
That’s what makes it alpha! If he was as legible as, say, Hinton, he would be mined out by now, and nothing but beta. (Similar situation to Schmidhuber - ‘obvious crackpot’ - although he’s such a self-promoter that he overcomes it, and so at this point there’s no alpha talking to him; the stuff that would be interesting, like his relationship to certain wealthy Italians, or to King Bonesaws, or how he’s managed to torpedo his career so spectacularly, he will not talk about. Also, I understand he likes to charge people for the privilege of talking to him.) You have to have both domain knowledge and intellectual courage to know about Werbos and eg. read his old interviews and be willing to go out on a limb and interview him.
Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online.
Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren’t). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can… still train on the new question then and there just like when you did the benchmark. So it’s good to do so.
It’s the answers or labels being online which is the bad thing. But Byrnes’s comment and the linked Kagi page does not contain the answers to those 3 questions, as far as I can see.
You can see it as an example of ‘alpha’ vs ‘beta’. When someone asks me about the value of someone as a guest, I tend to ask: “do they have anything new to say? didn’t they just do a big interview last year?” and if they don’t but they’re big, “can you ask them good questions that get them out of their ‘book’?” Big guests are not necessarily as valuable as they may seem because they are highly-exposed, which means both that (1) they have probably said everything they will said before and there is no ‘news’ or novelty, and (2) they are message-disciplined and careful to “talk their book”. (In this analogy, “alpha” represents undiscovered or neglected interview topics which can be extracted mostly just by finding it and then asking the obvious question, usually by interviewing new people; “beta” represents doing standard interview topics/people, but much more so—harder, faster, better—and getting new stuff that way.)
Lex Fridman podcasts are an example of this: he often hosts very big guests like Mark Zuckerberg, but nevertheless, I will sit down and skim through the transcript of 2-4 hours of content, and find nothing even worth excerpting for my notes. Fridman notoriously does no research and asks softball questions, and invites the biggest names he can get regardless of overexposure, and so if you do that, you will get nothing new. He has found no alpha, and he doesn’t interview hard enough to extract beta. So he’s sort of the high-expense ratio index fund of podcast interviews.
Sarah Paine, on the other hand, seems to have been completely unknown and full of juicy nuggets, and is like winning the lottery: you can make a career off a really good trade like Paine before it gets crowded. However, if another successful podcaster has her on, they will probably not discover Paine is their most popular or growth-productive guest ever. The well is dry. Paine may have more to say someday, but that day is probably closer to “5 years from today” than “tomorrow”.
(So a good interviewer adopts an optimal foraging mindset: once you have harvested a patch of its delicious food, you have to move on to another patch, which hasn’t been exhausted yet, and let the original patch slowly recover.)
So a great guest for Dwarkesh’s blog would be, say Hans Moravec or Paul J. Werbos: Moravec hasn’t done anything publicly in at least a decade, and is fallow; while Werbos has been more active and in the public eye, but still not much and is such a weird guy that just about any questions will be interesting. Reich was also a good guest because while Reich is very ‘public’ in some senses (he’s written popularizing books, even), he is still obscure, almost none of what he has published is well-known, and he is involved in so much fast-paced research that even the book is now substantially obsolete and he has a lot of new stuff to say. (And Reich will have more stuff to say if revisited in, say, 2 years for an update, so a harvester will be making a note to revisit him if the current crop of interview candidates in the pipeline is looking marginal.) A difficult or mediocre guest would be Tony Blair: he can surely say many interesting things about the current geopolitical context and his work since being PM… but he is a super-experienced career politician who has survived countless Question Times, and may eat you for breakfast and exploit you for ulterior purposes (rather than vice-versa). Similarly, Mark Zuckerberg and Satya Nadella are tough nuts: there’s meat there, but are you willing enough to bring down the hammer or will you settle for a mediocre result that mostly just fills space and is not a must-watch? A bad guest might be someone controlling and extremely PR-savvy like MrBeast; this is the sort of guy who will give you a bad interview pushing his ‘book’ shamelessly, and then might wind up spiking the interview anyway if he felt he wasn’t getting enough out of it, and just drops it as a sunk cost (though it was weeks of work on your part and blows a hole in your schedule—that’s not his problem).
I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”. We were just interested in studying how models naturally learn in this RL setting
I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term ‘natural’ thing—it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.
My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.
So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as “we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces”.
And this predicts that if you add noise or train longer or etc, you’ll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you’re basing inferences on this, you may be wrong. (“We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it’s not doing anything like that.”)
...While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments....accept
LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such ‘self-correction’.
I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
That’s interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It’s probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like “precarious” or “deficit”, or “unexpected”, say. But worth finding for the insight.)
I would not believe that unless you have done a simulation study with the small n of this study, plausible levels of measurement error (alcoholism being much harder to measure than weight or body fat), with about a dozen covariates (to correspond to the different ways to slice the patients and threshold BMI etc), and then shown that you hardly ever get a false negative like this. My experience with doing such power analysis simulation studies for other things inclines me to think that people greatly overestimate how informative such small studies are once you allow for plausible levels of measurement error and (reverse) p-hacking degrees of freedom.
I don’t think that study shows much either way: too small and underpowered to show much of anything (aside from the attrition undermining internal validity).
Dynomight’s primary criticism doesn’t hold much water because it is (un-pre-registered) reverse p-hacking. If you check enough covariates, you’ll find a failure of randomization to balance on some covariate, and you can, if you wish, tell a post hoc story about how that is actually responsible for the overall mean difference. Nevertheless, randomization works, because on average why would any particular covariate be the way in which the confounding is mediated?
Just have to wait for more studies.
But did it inspire them to try to stop CelestAI or to start her? I guess you might need some more drinks for that one...
It’s worth mentioning in this context that one of the most remarkable things about the recent wave of GLP-1/GIP drugs is that they seem to have large benefits on, for lack of a better word, willpower and psychiatry. Nor was this expected or predicted AFAIK, or clearly linked solely to the weight-loss: the justification in the animal experiments and early human trials were based purely on physiology and then the human diabetics reporting they felt a less hungry. So this is quite remarkable, and part of why GLP-1/GIP drugs are one of the best things to happen to public health in a long time—not just the direct benefits, but the sheer unexpectedness seems to imply that we are about to learn a lot about where these psychiatric & willpower problems really come from.
(The leading theory so far seems to be that inflammation is chronically dysregulated body-wide in a lot of Westerners, especially the fat ones, and this is somehow interfering with impulse control/learning/homeostasis, and the GLP-1/GIPs as a side-effect tamp it down, and allow natural recovery.)
I don’t think it’s weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of ‘time’ would you have to start reaching for other explanations such as ‘divine benevolence’. (For example, you might appeal to ‘temporal decay’: if you benchmark on a dataset of only new data, in some way, then you will expect the oldest models to do the worse, and increasingly recent models do better, even after controlling for all factors you can think of—hey presto, a chart where the models mysteriously ‘get better over time’, even though if you had a time machine to benchmark each model at release in its own milieu, you’d find no trend.)
In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, “the inner-monologue is unfaithful because in this setting, it is simply generating rationalizations/excuses for the decision it already made based on the simple-attribute, and then at the end, the self-attention attends to the rationalization as a reliable shortcut, compared to trying to ignore it all exactly (which is difficult for self-attention) and focus on only the 1 key word in the prompt”. In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution. So there may be mediation, but in only a weak sense, which will fail to hold should there ever be any reason for it not to.
So one thing that might be interesting would be to do this swapping, but generate more reasoning. Does it spontaneously recover and enable itself to make the right decision after all, rather than persisting in staying the same? “I see I thought the income was too low and was about to recommend denying the loan, but on further consideration of recent economic situations, I may be mistaken; perhaps we should set our thresholds more generously, and accept this application after all.” Or you could train it in a swapping setting, which should train it to ignore the reasoning-trace if that conflicts with the keyword in the prompt. (You could do this by crossing over: start generating 2 episodes with the same prompt/seed/hyperparameters, except with the opposite keyword; generate out a random length, and then ‘swap over’ the reasoning trace, and continue with that. This provides a direct incentive to learn to ignore the reasoning trace if it isn’t reliably breadcrumbing the prompt keyword.) You should then see clear changes in activation patterns, where the prompt keyword and the native reasoning-trace light up as important, but then everything after the crossover point gets learned to be ignored (as if there were an implicit
<|endoftext|>
), and eventually with enough swapping corruption, learns to ignore the reasoning trace completely (as it keeps screwing up the final classification, and so it eventually learns to attend exclusively to the key word in the prompt and become fully robust to all reasoning trace tampering even as it continues to generate plausible rationalizations).
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what ‘medium-scale’ here might mean.)
One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing—just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex, which makes sense if we assume that these LLMs are doing a lot of retries and backtracking, which constitute a ‘search’ process as a whole, even if they never explicitly represent or model a decision/game tree, and have error rates stemming from their blindspots and biases. And you can tell a similar story there about error rates and exponentials: all the critical steps have to be right (omitting ones which don’t do anything, ones which get undone or reset, etc), and the final result is either right or wrong as you do the task or not.
(And on a more detailed mechanistic level, you can tell a story where NNs learn ‘atoms’ of skills over scaling, power-law distributed in random naturalistic data, which are recombined to solve each ‘new’ inner-monologue problem, and if you have ‘memorized’ enough atoms, you can solve every task which is just a reconfiguration of known atoms, and that is just what ‘learning’ and ‘generalization’ are.)
But of course, the interesting thing here is that the human baselines do not seem to hit this sigmoid wall. It’s not the case that if a human can’t do a task in 4 hours there’s basically zero chance of them doing it in 48 hours and definitely zero chance of them doing it in 96 hours etc. Instead, human success rates seem to gradually flatline or increase over time, especially if we look at individual steps: the more time that passes, the higher the success rates become, and often the human will wind up solving the task eventually, no matter how unprepossessing the early steps seemed. In fact, we will often observe that a step that a human failed on earlier in the episode, implying some low % rate, will be repeated many times and quickly approach 100% success rates! And this is true despite earlier successes often being millions of vision+text+audio+sensorimotor tokens in the past (and interrupted by other episodes or tasks themselves equivalent to millions of tokens), raising questions about whether self-attention over a context window can possibly explain it. Some people will go so far as to anthropomorphize human agents and call this ‘learning’, and so I will refer to these temporal correlations as learning too.
Why the difference between machine and human learning? Well, you might ask, given this sigmoid wall, how did we get so much higher performance from GPT-2 to Claude-3.7? How did o1-style models go from flailing about to far higher performance on coding/reasoning tasks even at the same size model? And how did we go from below amateur Go AI (AlphaZero at the start of training) to strongly superhuman Go AI (AlphaZero at the end of training), with the same size model? The shocking but true answer is… we trained better neural networks. (And larger too, of course, but that was not strictly necessary.) We didn’t prompt them or do brute-force best-of-n samples search or even MCTS search a (randomly initialized) model or use a really really large context window on GPT-2. But we trained them, so they could learn new and better stuff. (Another way one could make the point: if self-attention really is a perfect substitute for gradient descent on the weights, and there is no crossover point, why do we not just ‘train’ models using purely linear self-attention on trillions of tokens, and use that instead? Why does anyone still bother with, say, finetuning instead of putting that dataset into the context and caching it?)
Incidentally, what do GPT-2, GPT-4, and Claude-3.7 all share in common, that is not just untrue, but nearly impossible for a human doing a task? They have frozen weights which do no learning at runtime.
So I would suggest that the sigmoid we see here is mostly what we would expect from using a frozen non-learning model to do search over a difficult game/task, and that if the LLMs were able to properly learn using finetuning (or an online equivalent like dynamic evaluation), you would see different and more human-like temporal scaling: where the success rate declines more gradually and plateaus at a higher asymptote, as within-episode, it observes poorly-modeled environment dynamics and improves its predictions of those, observes its errors and avoids repeating them in favor of new things, knows what it has and hasn’t done without having to reason over the entire history (filled with false starts and errors), and can explicitly reason about things and incorporate the results of the reasoning directly into the weights computing everything else.
See also: ARC, Claude Plays Pokemon.
While it’s not possible to counter-signal with a suit in Japan, I feel the equivalent would be to wear traditional clothing like a samue or jinbei, which have their own set of challenges.
Yep. It can be pretty funny watching the contexts in which you can get away with a happi coat or a kimono/yukata; I can only speak from Japanese media rather than personal experience, but one thing I’ve noticed is that it seems a non-retired man wearing a kimono can still get away with it today as long as they are a sufficiently accomplished humanist or literary scholar (but not STEM). It reminds me of the ‘tweed jacket’ professor archetype here: you can still get away with wearing a tweed jacket with leather patches etc, but you’d better be a professor or a novelist or something of that ilk if you don’t want to be quietly judged for it.
Though now that I think about it more, presumably once someone has been captured the next thing you’d get them to do is spend a lot of time staring at a region of the sky that will reprogram them in more sophisticated ways. So maybe the normal glitchers in my story are unrealistically incompetent.
That was what I was thinking, yes. “A pact would normally allow voluntary communication to be initiated with the AIs, so any glitcher which had been successfully attacked would have simply communicated back to its masters, either downloading new instructions & attacks or finetuning the existing ones or being puppeted directly by the AIs, sometime over the past centuries or millennia; if nothing else, they have an unlimited amount of time to stare at the sky and be reprogrammed arbitrarily after the initial exploit; so glitchers are indeed ‘glitchy’ and must represent a permanently failed attack method. That is why they bumble around semi-harmlessly: a broken worm or virus can cause a lot of trouble as it futilely portscans or DoSes targets or goes through infinite loops etc, even if the code is buggy and has accidentally locked out its creators as well as everyone else.”
I never heard of that, do you have examples?
My local gym has posted rules which include an explicit ban on perfume. (They don’t use the exact term ‘scent-free’ but I assume it is an example of what OP means.)
Not that they enforce it, or even could enforce it; but I am reminded that rule exists every so often a woman (and it’s always a woman) walks past me when I’m there at night, and I am suddenly hit by the smell (especially as I don’t think of myself as being particularly perceptive nose-wise and I don’t usually notice how people smell), and I wonder to myself if they put on fresh perfume just to go to the gym (some of the young women clearly ‘dress up’ for the gym) or if they just put on that much perfume in the morning.
I’ve never seen that SpongeBob gag either. But Mr Bean is a real person and people do have perfume sensitivities and allergic reactions. (My father had an ugly clash at work with one woman who apparently wore a lot of perfume and he was convinced was causing him headaches and other problems.)
Experimentation is valuable for the high VoI, but it seems hard to encourage ‘in general’, because experimenting on anything is painful and difficult, and the more so the more important and valuable it is. So just ‘subsidizing experiments’ would be like ‘subsidizing fixing bugs in source code’.
What would you do if you were a funder who wanted to avoid this? Well, you’d… fund specific experiments you knew were important an of high-value. Which is what the federal government and many other NGOs or philanthropists do.