gwern
First, I didn’t say it wasn’t communicating anything. But since you bring it up: it communicated exactly what jefftk said in the post already describing the scene. And what it did communicate that he didn’t say cannot be trusted at all. As jefftk notes, 4o in doing style transfer makes many large, heavily biased, changes to the scene, going beyond even just mere artifacts like fingers. If you don’t believe that people in that room had 3 arms or that the room looked totally different (I will safely assume that the room was not, in fact, lit up in tastefully cat-urine yellow in the 4o house style), why believe anything else it conveys? If it doesn’t matter what those small details were, then why ‘communicate’ a fake version of them all? And if it does matter what those small details were, surely it’s bad to communicate a fake, wrong version? (It is odd to take this blase attitude of ‘it is important to communicate, and what is communicated is of no importance’.)
Second, this doesn’t rebut my point at all. Whatever true or false things it does or does not communicate, the image is ugly and unaesthetic: the longer you look at it, the worse it gets, as the more bland, stereotypical, and strewn with errors and laziness you understand it to be. It is AI slop. (I would personally be ashamed to post an image even to IRC, never mind my posts, which embodies such low standards and disrespects my viewers that much, and says, “I value your time and attention so little that I will not lift a finger to do a decent job when I add a big attention-grabbing image that you will spend time looking at.”) Even 5 seconds to try to inpaint the most blatant artifacts, or to tell ChatGPT, “please try again, but without the yellow palette that you overuse in every image”*, would have made it better.
* incidentally, I’ve been asking people here if they notice how every ChatGPT 4o-generated image is by default yellow. Invariably, they have not. One or two of them have contacted me later to express the sentiment that ‘what has been seen cannot be unseen’. This is a major obstacle to image editing in 4o, because every time you inpaint, the image will mutate a decent bit, and will tend to turn a bit more yellow. (If you iterate to a fixed point, a 4o image turns into all yellow with sickly blobs, often faces, in the top left. It is certainly an odd generative model.)
Seems similar to the “anti-examples” prompting trick I’ve been trying: taking the edits elicited from a chatbot, and reversing them to serve as few-shot anti-examples of what not to do. (This would tend to pick up X-isms.)
One obvious reason to get upset is how low the standards of people posting them are. Let’s take jefftk’s post. It takes less than 5 seconds to spot how lazy, sloppy, and bad the hands and arms are, and how the picture is incoherent and uninformative. (Look at the fiddler’s arms, or the woman going under 2 arms that make zero sense, or the weird doors, or the table which seems to be somehow floating, or the dubious overall composition—where are the yellow fairy and non-fairy going, exactly?, or the fact that the image is the stereotypical cat-urine yellow of all 4o images.) Why should you not feel disrespected and insulted that he was so careless and lazy to put in such a lousy, generic image?
I think that’s exactly how it goes, yeah. Just free association: what token arbitrarily comes to mind? Like if you stare at some static noise, you will see some sort of lumpiness or pattern, which won’t be the same as what someone else sees. There’s no explaining that at the conscious level. It’s closer to a hash function than any kind of ‘thinking’. You don’t ask what SHA is ‘thinking’ when you put in some text and it spits out some random numbers & letters. (You would see the same thing if you did a MLP or CNN on MNIST, say. The randomly initialized NN does not produce a uniform output across all digits, for all inputs, and that is the entire point of randomly initializing. As the AI koan goes...)
It is not clear how the models are able to self-coordinate. It seems likely that they are simply giving what they believe would be the most common answer the same way a group of humans might. However, it is possible the models are engaging in more sophisticated introspection focussing on how they specifically would answer. Follow-up investigations could capture models’ chain of thought as well as tweak the prompt to indicate that the model should strive to be consistent with an answer a human might give or another company’s AI model might give. Circuit-tracing[6] might be a useful tool for future research into what is actually happening when a model self-coordinates
One possibility not mentioned here is that they are exploiting essentially arbitrary details of their initialization. (I’m not sure what to call this sort of a priori, acausal coordination.) Any NN is going to have undertrained associations, which are due largely to their random initialization, because it is difficult to be exactly uncorrelated and 0.000… etc. when you are a big complicated neural network which is being forced to generate big complicated high-dimensional outputs. This would be similar to glitch tokens. In this case, mechanistic interpretability will struggle to find anything meaningful (because that doesn’t really exist, it’s diffuse trends in all the weights adding up nonlinearly to a slight numerical imbalance etc) and the inner-monologues are probably going to be highly misleading or total confabulations (because there is no explanation and so no inner-monologue can be faithful).
(This is not quite what you usually think of with steganography or non-robust features, but of course, if you can start with a set of arbitrary associations of everything with everything, then it is a great way to create both of those and get emergent steganography. Because the more LLMs engage in self-coordination, the more they create a genuine signal in future training data to bootstrap the initial random associations into a true set of regularities which can be exploited as non-robust features and then turn into an explicit steganographic code.)
EDIT: the apparent arbitrariness and uninterpretability of the approximations subsequently reported in https://www.lesswrong.com/posts/qHudHZNLCiFrygRiy/emergent-misalignment-on-a-budget seem consistent with the predictions of the acausal coordination interpretation, rather than the Waluigi or truesight interpretation (and maybe the steganographic interpretation too).
Yes, a NN can definitely do something like know if it recognizes a datapoint, but it has no access to the backwards step per se. Like take my crashing example: how, while thinking in the forward pass, can it ‘know’ there will be a backward pass when there might be no backward pass (eg because there was a hardware fault)? The forward pass would appear to be identical in every way between the forward pass that happens when there is a backward pass, and when the backward pass doesn’t happen because it crashed. At best, it seems like a NN cannot do more than some sort of probabilistic thing involving gradient hacking, and hope to compute in such a way that if there is a following backward pass, then that will do something odd.
I don’t think this is impossible in principle, based on meta-learning examples or higher-order gradients (see eg my “input-free NN” esoteric NN architecture proposal), but it’s clearly a very difficult, fragile, strange situation where it’s certainly not obvious that a regular LLM would be able to do it, or choose to do so when there are so many other kinds of leakage or situated awareness or steganography possible.
You can look this up on knowyourmeme and confirm it, and I’ve done an interview on the topic as well. Now I don’t know much about “improving public discourse” but I have a long string of related celebrity hoaxes and other such nonsense which often crosses over into a “War of the Worlds” effect in which it is taken quite seriously...I have had some people tell me that I’m doing what you’re calling “degrading the public discourse,” but that couldn’t be farther from the truth. It’s literature of a very particular kind, in fact. Are these stories misinterpreted willfully, just for the chance to send friends a shocking or humorous link? Sure. You can caption the bell curve and label the far ends with “this story is completely true” and the midwits with “I’m so mad you’re degrading public discourse.” But only the intelligent people are really finding it humorous. And I am sure that what has actually happened is that the American sense of humor has become horribly degraded, which I think is the truly morbid symptom more than anything else, as humor is a very critical component to discernment...But even more than those really truly sad examples, there’s a sadder humorlessness in America where people are apparently no longer surprised or amused by anything.
This seems like a good explanation of how you have degraded the public discourse.
There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidirectional or unidirectional LLM tends to be all-or-nothing, in terms of the forward pass, so even if you are doing the usual optimization tricks, it’s heavyweight. A diffusion model lets you stop in the middle of the diffusing, or use that diffusion step to improve other parts, or pivot to a new output entirely.
A diffusion LLM in theory can do something like plan a sequence of future actions+states in addition to the token about to be executed, and so each token can be the result of a bunch of diffusion steps from a long time ago. This allows a small fast model to make good use of ‘easy’ timesteps to refine its next action: it just spends the compute to keep refining its model of the future and what it ought to do next, so at the next timestep, the action is ‘already predicted’ (if things were going according to plan). If something goes wrong, then the existing sequence may still be an efficient starting point compared to a blank slate, and quickly update to compensate. And this is quite natural compared to trying to bolt on something to do with MoEs or speculative decoding or something.
So your robot diffusion LLM can be diffusing a big context of thousands of tokens, which represents its plan and predicted environment observations over the next couple seconds, and each timestep, it does a little more thinking to tweak each token a little bit, and despite this being only a few milliseconds of thinking each time by a small model, it eventually turns into a highly capable robot model’s output and each action-token is ready by the time it’s necessary (and even if it’s not fully done, at least it is there to be executed—a low-quality action choice is often better than blowing the deadline and doing some default action like a no-op). You could do the same thing with a big classic GPT-style LLM, but the equivalent quality forward pass might take 100ms and now it’s not fast enough for good robotics (without spending a lot of time on expensive hardware or optimizing).
This post is an example of my method. Over the last 1-2 years, I’ve made heavy use of AIs, lately DeepSeek and Claude. I do the same with them: present my ideas, deal with their criticisms and objections—whether to correct them or take correction myself—until we’re agreed or the AI starts looping or hallucinating. So, when I say I have yet to hear, after all this time, credible, convincing arguments to the contrary, it’s after having spent the time and done the work that most people don’t even attempt.
Or, to put it less flatteringly, “I harangue the most sycophantic and new-agey LLMs I can find until they finally agree with me, in the absence of any objective feedback or empirical evidence, about something I’m already certain of, and I think this is intellectually valid work which deserves the name of ‘findings’ and is an ‘investigation’ far superior to whatever it is ‘most people’ do, rather than deserving the name ‘intellectual masturbation’.”
I have yet to hear, after all this time, credible, convincing arguments to the contrary.
You don’t say.
I think there are many ways that a LLM could have situated awareness about what phase it is in, but I’m not sure if the gradient descent itself is a possibility?
While a NN is running the forward pass without any backprop, it is computing exactly the same thing (usually) that it would be computing if it was running a forward pass before a backwards pass to do a backprop. Otherwise, the backprop can’t really work—if it doesn’t see the ‘real’ forward pass, how does it ‘know’ how to adjust the model parameters to make the model compute a better forward pass next time? So I can’t see how, while running a forward pass, a LLM could ‘know’ if it was about to do a backprop step on a piece of text; for all it knows, maybe someone is running its forward pass just to get out the log-prob at the end, and that is all. (Extreme counterexample: maybe there is a software error and the training code crashes before it finishes running
.update()
after running.forward()
; how could the model ‘know’ that this will happen?) This is true regardless of how many times it has trained on a piece of text.I’m skeptical that some sort of mismatch from successive gradient steps would be a cue either, because usually you are training at a critical batch size, and for these LLMs, we’d expect them to be updating on millions of tokens simultaneously, at least, and possibly rolling out the updated parameters in a staggered or partial fashion as well, so by the time a gradient update ‘arrives’ from a specific piece of text, that’s now also a gradient update over like a hundred+ books of text as well as itself, diluting any kind of signal.
And wouldn’t it usually train on a piece of text only a few times, at most? And if you are doing multi-epoch training, because you have started to run low on data, usually you train on the same datapoint at very widely separated, by many gradient steps, intervals; the memorization/forgetting dynamics imply you may have forgotten a datapoint entirely by the time it comes around again.
I agree it is poorly written, but I don’t think it is, strictly speaking, ‘LLM slop’. Or if it is, it’s not an LLM I am familiar with, or is an unusual usage pattern in some way… It’s just not written with the usual stylistic tics of ChatGPT (4o or o3), Claude-3/4, Gemini-2.5, or DeepSeek-r1.
For example, he uses a space after EM DASH but not before; no LLM does that (they either use no space or both before-after); he also uses ‘1) ’ number formatting, where LLMs invariably use ‘1. ’ or ‘#. ’ proper Markdown (and generally won’t add in stylistic redundancy like ‘are twofold’); he also doesn’t do the 4o ‘twist ending’ for his conclusion, the way a LLM would insist on. The use of sentence fragments is also unusual: LLMs insist on writing in whole sentences. The use of specific proper nouns like a ‘KKK clansmen’ or ‘Neil deGrasse Tyson’ are unusual for a LLM (the former because it is treading close to forbidden territory, and the latter because LLMs are conservative in talking about living people). Then there is the condescension: a LLM chatbot persona is highly condescending, but in covert, subtle ways, and requiring an appropriate context like tutoring, and they’re usually careful to avoid coming off as obviously condescending in a regular argumentative context like this and prefer sycophancy (presumably because it’s easy for a rater to notice a condescending style and get ticked off by it).
It also sounds like a piece of paper, or a map, or a person having vivid hallucinations before falling asleep. But unless you have a whiteboard which can be copied among several hundred people and teleport and be rolled up and fit in a jean pocket, which lets you timetravel so you can look at what used to be on the whiteboard or look at what people might write on it in the future, or ‘a whiteboard’ which is neither white (because there’s a colored map printed on it) nor ‘a board’ (because it’s arbitrarily many), which has a ledgerbook next to itself which writes itself, and so on, I would suggest that this does not ‘sound like a whiteboard’ to most people. (No, not even a Biblically-accurate whiteboard.)
Yes, there’s a lot of computer-related ones depending on how finegrained you get. (There’s a similar issue with my “Ordinary Life Improvements”: depending on how you do it, you could come up with a bazillion tiny computer-related ‘improvements’ which sort of just degenerates into ‘enumerating every thing ever involving a transistor in any way’ and is not enlightening the same way that, say, ‘no indoors smoking’ or ‘fresh mango’ is.) So I would just lump that one under ‘Machine Configuration/Administration § Software’ as one of the too-obvious-to-be-worth-mentioning hacks.
How did you check Claude’s claims here?
Idea: “Conferences as D&D tabletops”: you may be able to better organize a conference or convention by borrowing a tool from tabletop roleplaying games—players collaborate by directly manipulating or modifying a 2D map. It seems to me like this could be low-friction and flexibly handles a lot of things that existing ‘conware’ design patterns don’t handle well.
I have not done any work directly on it. The LLMs have kept improving so rapidly since then, especially at coding, that it has not seemed like a good idea to work on it.
Instead, I’ve been thinking more about how to use LLMs for creative writing or personalization (cf. my Dwarkesh Patel interview, “You should write more online”). To review the past year or two of my writings:
-
So for example, my meta-learning LLM interviewing proposal is about how to teach a LLM to ask you useful questions about your psychology so it can better understand & personalize (based on my observations that LLMs can now plan interviews by thinking about possible responses and selecting interesting questions, as a variant of my earlier “creativity meta-prompt” idea/hierarchical longform training); “Quantifying Truesight With SAEs” is an offline version about distilling down ‘authors’ to allow examination and imitation. And my draft theory of mathematicians essay is about the meta-RL view of math research suggesting that ‘taste’ reduces down to a relatively few parameters which are learned blackbox style as a bilevel optimization problem and that may be how we can create ‘LLM creative communities’ (eg. to extract out small sets of prompts/parameters which all run on a ‘single’ LLM for feedback as personas or to guide deep search on a prompt).
-
My “Manual of Style” is an experiment in whether you can iteratively, by asking a LLM to read your writings, extract out an explicit manual of style about how to ‘write like you’
It includes a new denoising/backtranslation prompt-engineering trick I am currently calling “anti-examples” where you have the LLM make editing suggestions (which turn it into ChatGPTese) and then you reverse that to fix the chatbot prior*.
So given how gargantuan context windows have become, and the existence of prompt caching, I think one may be able to write a general writing prompt, which includes a full MoS, a lot of anti-examples for several domains, some sample Q&As (optimized for information gain), instructions for how to systematically generate ideas, and start getting a truly powerful chatbot assistant persona with the scaled-up base models like GPT-5 which should start landing this year.
-
“Virtual comments” is another stab at thinking about how ‘LLM writing support’ can work, as well as reinventing the idea of ‘seriation’, and better semantic search via tree-shaped embeddings for both LLM & human writers (and the failed experiment with E-positive).
-
“Towards better RSS feeds” is about an alternative to Nenex commands: can you reframe writing as a sequence of atomic snippets which the LLM rewrites at various levels of abstraction/detail, which enables reading at those same levels, rather than locking people into a single level of detail, which inevitably suits few?
-
“October The First Is Too Late”, “Bell, Crow, Moon: 11 Poetic Variations”, “Area Man Outraged AI Has Not Solved Everything Yet”, “Human Cannibalism Alignment Chart”/”Hacking Pinball High Scores”, “Parliament of Rag & Bone”, “A Christmas Protestation”, “Second Life Sentences”, “On the Impossibility of Superintelligent Rubik’s Cube Solvers” were tests of how useful the LLMs are for iterative variation and selection using a ‘brainstorm’ generate-rank-select prompt and/or for hierarchical generation; they finally seem at the point where you can curate good stuff out of them and are genuinely starting to become useful for my nonfiction essays like “‘you could have invented Transformers’ tutorial”/”Cats As Horror Movie Villains”/typesetting HTML fractions/Rock-Paper-Scissors optimality (and demonstrate my views on acceptable use of generative media).
-
“Adding Bits Beats AI Slop” is about my observations about how this kind of intensive search + personalization seems critical to taking generative model outputs from mediocre slop to genuinely good.
-
“LLM Challenge: Write Non-Biblical Sentences” is an observation that for creativity, “big model smell” may be hard to beat, and you may just need large LLMs for high-end intellectual work, so one should beware false economies; similarly, “Towards Benchmarking LLM Diversity & Creativity” is about avoiding the LLMs getting ever worse for search purposes (mode-collapsed small models being a danger for Nenex uses—they are the ones that will be easy and tempting to run, but will hamstring you, and you have to go into it with eyes open).
-
“AI Cannibalism Can Be Good” is a quick explainer to try to overcome the intuition that there are no gains from ‘feeding AI inputs back into AI’ - if you don’t understand how this can be a good thing or why it’s not a perpetual motion machine, much of the foregoing will seem like nonsense or built on sand.
Obviously, I’ve also been doing a lot of regular writing, and working on the Gwern.net website infrastructure—adding the ‘blog’ feature has been particularly important, but just getting the small details right on things like “October The First” takes up plenty of time. But the overall through-line is, “how can we start getting meaningful creative work out of LLMs, rather than sleepwalking into the buzzsaw of superhuman coders creating Disneyland-without-children where all the esthetics is just RLHF’d AI slop?”
* This seems particularly useful for fiction. I’m working on a write up of an example with a Robin Sloan microfic where the LLM suggestions get better if you negate them, and particularly if you order them to think about why the suggestions were bad and what that implies before they make any new suggestions—which suggests, in conjunction with the success of the ‘brainstorm’ prompt, that a major failing of LLMs right now is just that they tend to treat corrections/feedback/suggestions in a ‘superficial’ manner because the reasoning-mode doesn’t kick in when it should. Interestingly, ‘superficial’ learning may be why dynamic-evaluation/finetuning seems to underperform: https://arxiv.org/abs/2505.01812 https://arxiv.org/abs/2505.00661#google Because adding paraphrases or Q&A to the finetuning data, although it cannot add any new information, improves performance; reminiscent of engrams/traces in human memory—you can have memorized things, but not be able to recall them, if there aren’t enough ‘paths’ to a memory.
-
I was trying out a hierarchical approach when I stopped, because I wasn’t sure if I could trust a LLM to rewrite a whole input without dropping any characters or doing unintended rewrites, and aside from being theoretically more scalable and potentially better by making each step easier and propagating the sorting top-down, if you explicitly turn it into a tree, you can easily check that you get back an exact permutation of the list each time and so that the rewrite was safe. I think that might be unnecessary at this point, given the steady improvement in prompt adherence, so maybe the task is now trivial.
There’s no explicit distances calculated: just asking the LLM to sort the list meaningfully.
Very funny, but the OA embeddings were always bad at sentence embedding, specifically, compared to other NN sentence-specialized embeddings; and as the original OA embedding paper somewhat defensively argues, it’s not even clear a priori what a sentence embedding should do because a sentence is such a cut-down piece of text, and doing well at a sentence embedding task may only be overfitting or come at the cost of performance on more meaningful text embedding tasks. (Similar to a word embedding: they are so poly-semantic or context-dependent that it seems like they have to have substantial limits—which is part of the motivation for Transformers in the first place, after all...)
That’s why I was experimenting with prompting a LLM to do seriation rewrites (instead of just splitting on punctuation to reuse my existing greedy-pairwise approach, and having done with it). A prompted LLM is taking full context and purpose into consideration, and avoid the issues with bad embeddings on very small text. So the seriation outputs aren’t crazily random, but sensible. This also helps avoid issues like Algon’s where a general-purpose embedding, blind to context or purpose, winds up emphasizing something you don’t care about; if Algon had been able to prompt a seriation, like ‘sort by theme’, the LLM would almost certainly not try to seriate it by the ‘question formatting’, but would organize his little Q&A question set by topic like biology then chemistry then physics, say. And if it doesn’t, then it’s easy to add more context or instructions. There are promptable embedders, but they are much more exotic and not necessary here.
(Which makes sense, because if you ask a LLM to sort a list of items in a freeform normal way, like a chat session, they are capable of it; in my poetry selection the other day, “Bell, Crow, Moon: 11 Variations”, I had Claude/Gemini/GPT suggest how exactly to sort the 11 poems we curated into a pleasing sequence, and they did come up with a much nicer poetry sequence than the original random one. And why wouldn’t they be able to do that, when they were good enough to write most of the poems in the first place?)
Yeah, it’s limited by what kind of structure you have. It did seriate your list successfully, sounds like, it’s just you have a lot of structure in the list that you don’t care about, and so no embedding is going to prioritize the other stuff and the distances aren’t useful to you in general. This will hurt any embedding-related use-case, not just seriation—presumably your k-NN lookups aren’t terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like “half of them start with ‘what is X’ and half don’t”) and there’s a lot of very semantic information.
One idea might be to pair debates with Delphi panels: do the usual Delphi method to get a consensus report beforehand, and then have them explain & debate what is left over as non-consensus (or possibly, if there are some experts who disagree hotly with the consensus report, bring them on for a debate with the original panel).