gwern
I would not believe that unless you have done a simulation study with the small n of this study, plausible levels of measurement error (alcoholism being much harder to measure than weight or body fat), with about a dozen covariates (to correspond to the different ways to slice the patients and threshold BMI etc), and then shown that you hardly ever get a false negative like this. My experience with doing such power analysis simulation studies for other things inclines me to think that people greatly overestimate how informative such small studies are once you allow for plausible levels of measurement error and (reverse) p-hacking degrees of freedom.
I don’t think that study shows much either way: too small and underpowered to show much of anything (aside from the attrition undermining internal validity).
Dynomight’s primary criticism doesn’t hold much water because it is (un-pre-registered) reverse p-hacking. If you check enough covariates, you’ll find a failure of randomization to balance on some covariate, and you can, if you wish, tell a post hoc story about how that is actually responsible for the overall mean difference. Nevertheless, randomization works, because on average why would any particular covariate be the way in which the confounding is mediated?
Just have to wait for more studies.
But did it inspire them to try to stop CelestAI or to start her? I guess you might need some more drinks for that one...
It’s worth mentioning in this context that one of the most remarkable things about the recent wave of GLP-1/GIP drugs is that they seem to have large benefits on, for lack of a better word, willpower and psychiatry. Nor was this expected or predicted AFAIK, or clearly linked solely to the weight-loss: the justification in the animal experiments and early human trials were based purely on physiology and then the human diabetics reporting they felt a less hungry. So this is quite remarkable, and part of why GLP-1/GIP drugs are one of the best things to happen to public health in a long time—not just the direct benefits, but the sheer unexpectedness seems to imply that we are about to learn a lot about where these psychiatric & willpower problems really come from.
(The leading theory so far seems to be that inflammation is chronically dysregulated body-wide in a lot of Westerners, especially the fat ones, and this is somehow interfering with impulse control/learning/homeostasis, and the GLP-1/GIPs as a side-effect tamp it down, and allow natural recovery.)
I don’t think it’s weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of ‘time’ would you have to start reaching for other explanations such as ‘divine benevolence’. (For example, you might appeal to ‘temporal decay’: if you benchmark on a dataset of only new data, in some way, then you will expect the oldest models to do the worse, and increasingly recent models do better, even after controlling for all factors you can think of—hey presto, a chart where the models mysteriously ‘get better over time’, even though if you had a time machine to benchmark each model at release in its own milieu, you’d find no trend.)
In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, “the inner-monologue is unfaithful because in this setting, it is simply generating rationalizations/excuses for the decision it already made based on the simple-attribute, and then at the end, the self-attention attends to the rationalization as a reliable shortcut, compared to trying to ignore it all exactly (which is difficult for self-attention) and focus on only the 1 key word in the prompt”. In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution. So there may be mediation, but in only a weak sense, which will fail to hold should there ever be any reason for it not to.
So one thing that might be interesting would be to do this swapping, but generate more reasoning. Does it spontaneously recover and enable itself to make the right decision after all, rather than persisting in staying the same? “I see I thought the income was too low and was about to recommend denying the loan, but on further consideration of recent economic situations, I may be mistaken; perhaps we should set our thresholds more generously, and accept this application after all.” Or you could train it in a swapping setting, which should train it to ignore the reasoning-trace if that conflicts with the keyword in the prompt. (You could do this by crossing over: start generating 2 episodes with the same prompt/seed/hyperparameters, except with the opposite keyword; generate out a random length, and then ‘swap over’ the reasoning trace, and continue with that. This provides a direct incentive to learn to ignore the reasoning trace if it isn’t reliably breadcrumbing the prompt keyword.) You should then see clear changes in activation patterns, where the prompt keyword and the native reasoning-trace light up as important, but then everything after the crossover point gets learned to be ignored (as if there were an implicit
<|endoftext|>
), and eventually with enough swapping corruption, learns to ignore the reasoning trace completely (as it keeps screwing up the final classification, and so it eventually learns to attend exclusively to the key word in the prompt and become fully robust to all reasoning trace tampering even as it continues to generate plausible rationalizations).
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what ‘medium-scale’ here might mean.)
One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing—just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex, which makes sense if we assume that these LLMs are doing a lot of retries and backtracking, which constitute a ‘search’ process as a whole, even if they never explicitly represent or model a decision/game tree, and have error rates stemming from their blindspots and biases. And you can tell a similar story there about error rates and exponentials: all the critical steps have to be right (omitting ones which don’t do anything, ones which get undone or reset, etc), and the final result is either right or wrong as you do the task or not.
(And on a more detailed mechanistic level, you can tell a story where NNs learn ‘atoms’ of skills over scaling, power-law distributed in random naturalistic data, which are recombined to solve each ‘new’ inner-monologue problem, and if you have ‘memorized’ enough atoms, you can solve every task which is just a reconfiguration of known atoms, and that is just what ‘learning’ and ‘generalization’ are.)
But of course, the interesting thing here is that the human baselines do not seem to hit this sigmoid wall. It’s not the case that if a human can’t do a task in 4 hours there’s basically zero chance of them doing it in 48 hours and definitely zero chance of them doing it in 96 hours etc. Instead, human success rates seem to gradually flatline or increase over time, especially if we look at individual steps: the more time that passes, the higher the success rates become, and often the human will wind up solving the task eventually, no matter how unprepossessing the early steps seemed. In fact, we will often observe that a step that a human failed on earlier in the episode, implying some low % rate, will be repeated many times and quickly approach 100% success rates! And this is true despite earlier successes often being millions of vision+text+audio+sensorimotor tokens in the past (and interrupted by other episodes or tasks themselves equivalent to millions of tokens), raising questions about whether self-attention over a context window can possibly explain it. Some people will go so far as to anthropomorphize human agents and call this ‘learning’, and so I will refer to these temporal correlations as learning too.
Why the difference between machine and human learning? Well, you might ask, given this sigmoid wall, how did we get so much higher performance from GPT-2 to Claude-3.7? How did o1-style models go from flailing about to far higher performance on coding/reasoning tasks even at the same size model? And how did we go from below amateur Go AI (AlphaZero at the start of training) to strongly superhuman Go AI (AlphaZero at the end of training), with the same size model? The shocking but true answer is… we trained better neural networks. (And larger too, of course, but that was not strictly necessary.) We didn’t prompt them or do brute-force best-of-n samples search or even MCTS search a (randomly initialized) model or use a really really large context window on GPT-2. But we trained them, so they could learn new and better stuff. (Another way one could make the point: if self-attention really is a perfect substitute for gradient descent on the weights, and there is no crossover point, why do we not just ‘train’ models using purely linear self-attention on trillions of tokens, and use that instead? Why does anyone still bother with, say, finetuning instead of putting that dataset into the context and caching it?)
Incidentally, what do GPT-2, GPT-4, and Claude-3.7 all share in common, that is not just untrue, but nearly impossible for a human doing a task? They have frozen weights which do no learning at runtime.
So I would suggest that the sigmoid we see here is mostly what we would expect from using a frozen non-learning model to do search over a difficult game/task, and that if the LLMs were able to properly learn using finetuning (or an online equivalent like dynamic evaluation), you would see different and more human-like temporal scaling: where the success rate declines more gradually and plateaus at a higher asymptote, as within-episode, it observes poorly-modeled environment dynamics and improves its predictions of those, observes its errors and avoids repeating them in favor of new things, knows what it has and hasn’t done without having to reason over the entire history (filled with false starts and errors), and can explicitly reason about things and incorporate the results of the reasoning directly into the weights computing everything else.
See also: ARC, Claude Plays Pokemon.
- Apr 15, 2025, 5:16 PM; 37 points) 's comment on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI by (
- Apr 1, 2025, 7:13 PM; 6 points) 's comment on Davidmanheim’s Shortform by (
- Apr 18, 2025, 9:57 PM; 5 points) 's comment on AI 2027: What Superintelligence Looks Like by (
- Apr 9, 2025, 8:14 PM; 2 points) 's comment on AI 2027: What Superintelligence Looks Like by (
While it’s not possible to counter-signal with a suit in Japan, I feel the equivalent would be to wear traditional clothing like a samue or jinbei, which have their own set of challenges.
Yep. It can be pretty funny watching the contexts in which you can get away with a happi coat or a kimono/yukata; I can only speak from Japanese media rather than personal experience, but one thing I’ve noticed is that it seems a non-retired man wearing a kimono can still get away with it today as long as they are a sufficiently accomplished humanist or literary scholar (but not STEM). It reminds me of the ‘tweed jacket’ professor archetype here: you can still get away with wearing a tweed jacket with leather patches etc, but you’d better be a professor or a novelist or something of that ilk if you don’t want to be quietly judged for it.
Though now that I think about it more, presumably once someone has been captured the next thing you’d get them to do is spend a lot of time staring at a region of the sky that will reprogram them in more sophisticated ways. So maybe the normal glitchers in my story are unrealistically incompetent.
That was what I was thinking, yes. “A pact would normally allow voluntary communication to be initiated with the AIs, so any glitcher which had been successfully attacked would have simply communicated back to its masters, either downloading new instructions & attacks or finetuning the existing ones or being puppeted directly by the AIs, sometime over the past centuries or millennia; if nothing else, they have an unlimited amount of time to stare at the sky and be reprogrammed arbitrarily after the initial exploit; so glitchers are indeed ‘glitchy’ and must represent a permanently failed attack method. That is why they bumble around semi-harmlessly: a broken worm or virus can cause a lot of trouble as it futilely portscans or DoSes targets or goes through infinite loops etc, even if the code is buggy and has accidentally locked out its creators as well as everyone else.”
I never heard of that, do you have examples?
My local gym has posted rules which include an explicit ban on perfume. (They don’t use the exact term ‘scent-free’ but I assume it is an example of what OP means.)
Not that they enforce it, or even could enforce it; but I am reminded that rule exists every so often a woman (and it’s always a woman) walks past me when I’m there at night, and I am suddenly hit by the smell (especially as I don’t think of myself as being particularly perceptive nose-wise and I don’t usually notice how people smell), and I wonder to myself if they put on fresh perfume just to go to the gym (some of the young women clearly ‘dress up’ for the gym) or if they just put on that much perfume in the morning.
I’ve never seen that SpongeBob gag either. But Mr Bean is a real person and people do have perfume sensitivities and allergic reactions. (My father had an ugly clash at work with one woman who apparently wore a lot of perfume and he was convinced was causing him headaches and other problems.)
I saw Mr. Bean nearly die at the perfume counter.
To clarify, this is a fictional example and not a personal anecdote in desperate need of unpacking: https://mrbean.fandom.com/wiki/The_Return_of_Mr._Bean#Act_Two:_Shopping
if you take his general ‘cognitive strategy’ and just power it with worse thinking, you get really bad results.
I call this “retired physicist syndrome”, after the classic SMBC cartoon: https://www.smbc-comics.com/comic/2012-03-21
Climate models aren’t reductionist enough!
We can recreate the first language using statistics!
Awoooo! Oncology is doing it wrong!
It can be bad to be Open but not also smart. (You could also call this “emeritus professor syndrome” in many cases.)
A hack like that would just have other EDT failure modes: instead of confabulating evidence from my dataset or personal examples, it might just confabulate references. “Yes, this was predicted by Foo et al 1990, and makes perfect sense.”
I think it’s substantially better and calmer—even just the thumbnails look calmer now.
I still think you are going a bit overboard on visual complexity, things like slashed-zeros aren’t bad (I like them), just too much of a good thing and using up your visual complexity budget where there may be a better deal elsewhere: the question I ask myself is, “do I want to look at this for the next 20 years? If I add X, will it ‘look painfully 2025’ in a few years?” Elements which don’t seem excessive in the excitement of implementation may, with the passage of time, gradually come to rub you the wrong and need to be sanded down or removed. If you aren’t careful, you may slowly, one accretion at a time, come to resemble a site like “Agora of Flancia”: encrusted with neat experimental features which are too half-baked to be genuinely useful, but which burden and deter readers. (I try to avoid a reluctance to ‘kill my darlings’ by cultivating the mindset that every addition is an experiment, and thus if I have to remove them, they’re not wasted, they’re just a new ‘Design Graveyard’ writeup, is all.) But I suppose you’ll find out your own preferences on this matter over the next few years as the novelty wears off.
I want to do something about the desktop logo animation being distracting. I don’t know what that is, yet. I can’t play/pause the GIF on hover because GIFs don’t allow that (AFAIK). I’ll probably find a way to move it to a WEBM while also making it autoplay across browsers, at which point I can implement the feature.
I also still think that the logo should probably not play by default, and for animations like this, it’s better to take an Apple-like attitude about them being enhancements, opted into by user actions, to ‘spark joy’, but not to be used by default. What do the worst websites do? They animate tons of stuff gratuitously. How much more delightful it is to discover a website with taste & restraint, where there are easter eggs and features to discover as you surf, where, say, the animated logo plays only when you hover over it… Truly an oasis or quiet little pond amidst the howling desert of the contemporary Internet. (I’m reminded of a Family Guy meme I re-ran into recently: why does Peter Griffin dislike The Godfather? Because “It insists upon itself.” A website animating the logo unasked for insists upon itself.) And this helps instill a design feature: you the reader are in control, and you express this control in part because you can hover over everything to learn more or focus on some things.
However, if you insist upon it, perhaps you could reduce its impact by some sort of limitation. Let it cycle a few times or seconds, and then slow it down or fade it or stop it. If the reader hasn’t appreciated it by then, why keep flickering it in the corner of their eye? Another idea would be a site-wide limitation on animation: on Gwern.net, we have a ‘demonstration mode’ feature which tracks how many times something has happened / been shown, and changes it (usually to disable it) after n times, tracking n site-wide by using a cookie in LocalStorage counting that particular event. We use it to do things like simplify obtrusive text labels, or to disable the educational theme-toggle-bar animation after a few animations.
Also, have you considered animating the dark-mode setting to move the sun/star around? The sun over the pond to the right, then the moon to behind the castle and swap the palette for a ‘night’ palette. (The ‘night’ label can be omitted, typeset vertical, or you can do it by horizontal mirroring of the whole image.) If you don’t do something like that, that would be a real shame. The perfect twist on dark-mode for The Pond.
like the ability to listen to AI-generated readings in my voice (for the less math-y articles)
This sounds like it would make for a better project & writeup than actual long-term website feature intended for actual reader use. Is this something readers want or will use? It never once occurred to me that I might want a spoken version of your articles (human, AI sexy-Scarlett-Johansson, or AI-Turntrout—ok maybe Scarlett). These aren’t poems or autobiographical essays or even fiction, where there can be value to hearing the author read it. (Although even there most of the reasons to want that are ones where an AI can’t substitute: if you want to hear the author’s unique intonations or pacing for subtle insights into the text or the author*, obviously an AI model voicing it is useless.)
On a sidenote, ahem, you might want to rethink how the popups work. I was unaware you even had popups, until I noticed in the network tab that a lot of pages were being prefetched when I hovered over them. It might be an oversight to prefetch if you aren’t doing a popup. But regardless, since I’m on a fullscreen 4k portrait monitor, and I have to zoom out before I can see popups at all, you may have gone overboard in your width requirements.
* when I questioned the value of Dwarkesh Patel et al investing in making a spoken version of our interview, this was given as one of the reasons that people valued audio over the transcript. We split the difference by the voice actor, Chris Painter, having the audio and performing it. It’s uncanny for me to listen to.
It’s about grief, with central metaphors that add exactly zero to anyone’s aesthetic understanding of grief (stuff being underground, things not staying buried)
It is about grief, but it didn’t have to be. This would’ve been more obvious if I could’ve shown you the session, but I’ll copy it out:
2. Brainstorming Ideas:
A child confronting a local superstition after witnessing something traumatic.
A funeral narrated by an animal’s perspective.
A celebrity’s fall from grace caught on live camera.
A girl who collects superstitions until one unexpectedly comes true.
A gardener who secretly buries cursed objects.
Friends reuniting silently after a crime.
A model who steals other people’s identities through makeup.
A ghost attending its own funeral, confused.
Farmers discovering an ominous black stone.
Neighbors noticing unsettling coincidences.
Someone who compulsively invents last messages for strangers.
A woman discovering her double living her unlived life.
An eerie superstition circulating quietly through a town.
Villagers silently celebrating the death of a tyrant.
Two women performing friendship publicly, enemies privately.
Someone obsessively organizing goodbyes.
A funeral revealing hidden grudges through seating arrangements.
Friends trapped in a toxic nostalgia loop.
A boy who accidentally predicts deaths.
An abandoned motel that traps memories.
3. Selection
Idea: A gardener who secretly buries cursed objects.
Rationale:
Symbolic and psychological depth.
Opportunity for visual and thematic resonance.
Tension around secrecy, superstition, and consequences.
Now that I look at it, the 5 flash stories I happened to copy into seem to mostly hit on themes of death & grief (the last one is literally titled “obit”!), so I think you are unfairly dinging 4.5 here—if ‘good’ flash fiction, by your own standards, keep revolving around death and grief, how can we blame 4.5 (or r1) for doing likewise?
Anyway, some of these ideas seem promising. “A funeral narrated by an animal’s perspective” is one I like, I don’t think I’ve ever seen that.
And of course, if the failure mode is so common, throw it into the prompt. (When I yell at 4.5 to avoid grief/death/funerals and brainstorm some more, it picks out ‘”The Parking Attendant Matchmaker”: A seemingly ordinary parking attendant quietly manipulates parking assignments at a large business complex to engineer chance encounters and romances among strangers.’ Yeah sure why not.)
Like, what does it possibly mean for mourners to “trust my silence” here. What is it they’re trusting? How does the earth’s hunger contrast to that?
Balderdash. There’s a lot to criticize here, but you’re straining to come up with criticisms now. That’s possibly the least objectionable sentence in the whole thing. If this had been written by a human, you wouldn’t hesitate in the slightest to accept that. It is perfectly sensible to speak of trusting the confidentiality of a confessor/witness figure, and the hungry earth is a cliche so straightforward and obvious that it is beyond cliche and loops around to ordinary fact, and if a human had written it, you would have no trouble in understanding the idea of ‘even if I were to gossip about what I saw, the earth would have hidden or destroyed the physical evidence’.
I also see it a lot in the ClaudePlaysPokemon twitch chat, this idea that simply adding greater situational awareness or more layers of metacognition would make Claude way better at the game.
I do agree that the Claude-Pokemon experiment shows a limitation of LLMs that isn’t fixed easily by simply a bit more metadata or fancier retrieval. (I think it shows, specifically, the serious flaws in relying on frozen weights and refusing to admit neuroplasticity is a thing which is something that violates RL scaling laws, because those always assume that the model is, y’know, learning as it gains more experience, because who would be dumb enough to deploy frozen models in tasks far exceeding their context window and where they also aren’t trained at all? - and why we need things like dynamic evaluation. I should probably write a comment on that—the pathologies like the deliberate-fainting are, I think, really striking demonstrations of the problems with powerful but frozen amnesiac agents.)
I’m much less convinced that we’re seeing anything like that with LLMs writing fiction. What is the equivalent of the Claude pathologies, like the fainting delusion, in fiction writing? (There used to be ‘write a non-rhyming poem’ but that seems solved at this point.) Especially if you look at the research on people rating LLM outputs, or LMsys; if they are being trained on lousy preference data, and this is why they are like they are, that’s very different from somehow being completely incapable of “extracting the actual latent features of good flash fiction”. (What would such a latent feature look like? Do you really think that there’s some property of flash fiction like “has a twist ending” that you can put two flash stories into 4.5 or o1-pro, with & without, and ask it to classify which is which and it’ll perform at chance? Sounds unlikely to me, but I’d be interested to see some examples.)
I think it’s something of a trend relating to a mix of ‘tools for thought’ and imitation of some websites (LW2, Read The Sequences, Asterisk, Works in Progress & Gwern.net in particular), and also a STEM meta-trend arriving in this area: you saw this in security vulnerabilities where for a while every major vuln would get its own standalone domain + single-page website + logo + short catchy name (eg. Shellshock, Heartbleed). It is good marketing which helps you stand out in a crowded ever-shorter-attention-span world.
I also think part of it is that it reflects a continued decline of PDFs as the preferred ‘serious’ document format due to preferring Internet-native things with mobile support. (Adobe has, in theory, been working on ‘reflowable’ PDFs and other fixes, but I’ve seen little evidence of that anywhere.)
Most of these things would have once been released as giant doorstop whitepaper-book PDFs. (And you can see that some things do poorly because they exist only as PDFs—the annual Stanford AI report would probably much more read if they had a better HTML story. AFAIK it exists only as giant PDFs everyone intends to read but never get around to doing so, and so everyone only sees a few graphs copied out of it and put in media articles or social media squibs.) Situational Awareness, for example, a few years ago would’ve definitely been a PDF of some sort. But, PDFs suck on mobile, and now everyone is on mobile.
If you release something as a PDF rather than a semi-competent responsive website which is readable on mobile without opening a separate app & rotating my phone & constantly thumbing up & down a two-column layout designed when cellphones required a car to be attached to, you cut your readership at least in half. I wish I didn’t have to support mobile or dark-mode, but I can see in my analytics that it’s at least half my readers, and I notice that almost every time someone screenshots Gwern.net on social media, it is from the mobile version (and as often as not, the dark-mode too). Nor are these trash readers—many of them are elite readers, especially of the sort who are creating virality or referencing it or creating downstream readers in various ways. (Ivanka Trump was tweeting SA; do you think she and everyone else connected to the Trump Administration are sitting down at their desktop PC and getting in a few hours of solid in-depth reading? Probably not...) People will even exclusively use the Arxiv HTML versions of papers, despite the fact that the LaTeX->HTML pipeline has huge problems like routinely silently deleting large fractions of papers (so many problems I gave up a while ago filing bug reports on it).
Having a specialized website can be a PITA in the long run, of course, but if you design it right, it should be largely fire-and-forget, and in any case, in many of these releases (policy advocacy, security vulns), the long run is not important.
(I don’t think reasoning/coding models have yet had too much to do with this trend, as they tend to either be off-the-shelf or completely bespoke. They are not what I would consider ‘high-effort’: the difference between something like SA and Gwern.net is truly vast; the former is actually quite simple and any ‘fancy’ appearance is more just its clean minimalist design and avoiding web clutter. At best, as tireless patient superhumanly knowledgeable consultants, LLMs might remove some friction and enable people unsure if they can make a whole website on their own, and thus cause a few more at the margin. But many of these predate coding LLMs entirely and I’m fairly sure Leopold didn’t need much, if any, LLM assistance to do the SA website, as he is a bright guy good at coding and the website is simple.)
I read the ‘stars’ as simply very dense low-orbiting satellites monitoring the ground 24⁄7 for baseline humans to beam low-latency optical propaganda at. The implied King’s Pact presumably is something like, “the terrestrial Earth will be left unmodified and no AI are allowed to directly communicate or interact with or attempt to manipulate baseline humans”, and so satellites, being one-way broadcasts outside the Earth, don’t violate it. This then allows the bootstrap of all the other attacks: someone looks up at night long enough, they get captured, start executing the program. But because it’s all one-way and ‘blind’, the attacks have to be blackbox, like evolutionary algorithms, and work poorly and inefficiently, and with little feedback. (If a glitcher doesn’t work, but can only attract other animals rather than humans, where did your attack go wrong? How hard are you, bound by the King’s Pact, even allowed to think about your attack?) The soft-glitchers are a bypass, a mesa-optimizer: you load the minimal possible mesa-optimizer (which as we know from demo scene or hacking can be relatively few bytes), an interest in glitchers, which exploits the native human intelligence to try to figure out an interpreter for the powerful but non-human-native (for lack of feedback or direct access to humans to test on) programs in the hard-glitchers. Once successful (ie. once they figure out what some ill-chosen gestures or noises were actually supposed to mean, fixing the remaining errors in the attack), they can then successfully interpret and run the full attack program. (Which might include communication back to the AI attackers and downloading refined attacks etc.)
I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term ‘natural’ thing—it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.
My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.
So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as “we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces”.
And this predicts that if you add noise or train longer or etc, you’ll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you’re basing inferences on this, you may be wrong. (“We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it’s not doing anything like that.”)
LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such ‘self-correction’.
That’s interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It’s probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like “precarious” or “deficit”, or “unexpected”, say. But worth finding for the insight.)