Got it. The term implies something much more specific to me.
Kaj_Sotala
Note that Auren has some definite whispering earring vibes
Interested what makes you say that, since I’ve been trying it out for about a week and haven’t gotten that vibe. I’ve found it useful for getting more awareness into my patterns, but the way I interact with it feels very me-driven. I choose what topics to bring up with it, when to continue or drop a conversation, etc. So far at least it hasn’t felt like it would have made any decisions for me, though conversations with it have shaped some small decisions (e.g. after I have a conversation with it about how doing X makes me feel worse, I had a slightly easier time remembering not to do X later).
Didn’t expect to see alignment papers to get cited this way in mainstream psychology papers now.
https://www.sciencedirect.com/science/article/abs/pii/S001002772500071X
Cognition
Volume 261, August 2025, 106131Loopholes: A window into value alignment and the communication of meaning
Abstract. Intentional misunderstandings take advantage of the ambiguity of language to do what someone said, instead of what they actually wanted. These purposeful misconstruals or loopholes are a familiar facet of fable, law, and everyday life. Engaging with loopholes requires a nuanced understanding of goals (your own and those of others), ambiguity, and social alignment. As such, loopholes provide a unique window into the normal operations of cooperation and communication. Despite their pervasiveness and utility in social interaction, research on loophole behavior is scarce. Here, we combine a theoretical analysis with empirical data to give a framework of loophole behavior. We first establish that loopholes are widespread, and exploited most often in equal or subordinate relationships (Study 1). We show that people reliably distinguish loophole behavior from both compliance and non-compliance (Study 2), and that people predict that others are most likely to exploit loopholes when their goals are in conflict with their social partner’s and there is a cost for non-compliance (Study 3). We discuss these findings in light of other computational frameworks for communication and joint-planning, as well as discuss how loophole behavior might develop and the implications of this work for human–machine alignment.
Introduction
At the height of the Russian revolution of 1917, several thousand Vyborg mill-workers found themselves face-to-face with a Cossack cavalry formation. It was a tense moment. Twelve years earlier a similar standoff ended in bloodshed. This time, when the officers commanded the cavalry to block the marchers, the Cossacks complied perfectly: The cavalry arranged their horses into a blockade, and then stayed still, just as they had been told. They remained un-moving, as the protesters, realizing the cavalry’s intent, ducked under the horses and carried on marching (Miéville, 2017). The cavalry knew exactly what their officers meant, but instead of doing what was wanted, they did what they were told.
Intentional misunderstandings, or loopholes, are a familiar phenomenon in human society and culture. Loopholes have been exploited throughout history by people loath to comply with a directive and unwilling to risk outright disobedience (Scott, 1985). In law, there is perennial concern with “malicious compliance”, and with distinguishing form from substance, text from purpose, and the letter of the law from the spirit of the law (Fuller, 1957, Isenbergh, 1982, Katz, 2010). In art and fable, there are centuries-old stories of people outwitting malevolent forces through clever misinterpretations, or being tricked in this way by mischievous spirits and gods (Uther, 2004). On the playground, age-old games of guile remain popular to this day (Opie & Opie, 2001). Closer to home, the senior author once told a child, “It’s time to put the tablet down”, only to have the child put the tablet physically down on the table, and keep right on watching their movie (Bridgers, Schulz, & Ullman, 2021).
The processes underlying loopholes are not just of legal, historical, or parental interest: In the field of artificial intelligence and machine learning, machines that ‘do what you say, but not what you want’ (Krakovna, 2020, Lehman et al., 2020) are an increasingly pressing concern among researchers and policy makers (Amodei et al., 2016, Russell, 2019). While current machines do not willfully misunderstand goals any more than a bridge is lazy by virtue of falling down, certain errors give people the impression that machines are ‘cheating’. And regardless of intent, figuring out how to safeguard against such behaviors is a major challenge for AI safety.
The motivation and ability to understand goals and cooperate are fundamental to the success of the human species...
oFurthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
Thanks for this! This is a good point. Do you think you can go further and say why you think it will be very hard to fix in the near term, so much so that models won’t be useful for AI research?
This is more of an intuition than a rigorous argument, but to try to sketch it out...
For why, basically all the arguments in the old Sequences for why aligning AI should be hard. For a while it seemed like things like the Outcome Pump thought experiment had aged badly, since if you told a modern LLM “get my mother out of the burning building”, it would certainly understand all of the implicit constraints in what you meant by that.
But as noted in Zvi’s post, this seems to be breaking down with the way reasoning models are trained:
This isn’t quite how I’d put it, but directionally yes:
Benjamin Todd: LLMs were aligned by default. Agents trained with reinforcement learning reward hack by default.
Peter Wildeford: this seems to be right – pretty important IMO
Caleb Parikh: I guess if you don’t think RLHF is reinforcement learning and you don’t think Sydney Bing was misaligned then this is right?
Peter Wildeford: yeah that’s a really good point
I think the right characterization is more that LLMs that use current methods (RLHF and RLAIF) largely get aligned ‘to the vibes’ or otherwise approximately aligned ‘by default’ as part of making them useful, which kind of worked for many purposes (at large hits to usefulness). This isn’t good enough to enable them to be agents, but it also isn’t good enough for them figure out most of the ways to reward hack.
Whereas reasoning agents trained with full reinforcement will very often use their new capabilities to reward hack when given the opportunity.
My guess of what’s going on is something like… ordinary LLMs are trained to give the kinds of answers they’d predict an empathetic persona would give. In pretraining, they learn a model of what humans care about that they use to predict text, and then the RLHF pushes them toward something like “give the kinds of answers that a caring and thoughtful person would give”, with “a caring and thoughtful persona” being something that they largely already have a pre-existing concept for, the training just needs to locate it within their persona-space. They were also trained by human raters giving them feedback on the kinds of answers that were good at fulfilling the spirit and not just the letter of the request.
But if you are training them by e.g. automatically generating lots of programming problems and then rewarding them to the proportion that they manage to solve those, then you are only training them on how to solve the letter of the request. The automatic benchmarking reward process doesn’t have a sense of the spirit of the request, it just knows the narrowly-defined objective criteria.
Now the obvious way to deal with that would be to have the answers also be rated by an LLM that measured something like “how well does this meet the spirit of the request”. But then you quickly run into the risk of employing what Zvi calls The Most Forbidden Technique. E.g. the “spirit evaluator” could do things like looking at the chain-of-thought that produced the solution and noticing that the reasoning model it is evaluating mentions cheating. If those kinds of solutions were discouraged, then this would teach the reasoning model to just hide the cheating from the spirit evaluator.
Without the benefit of obvious clues in the chain-of-thought, the spirit evaluator would basically have to verify the reasoning model’s solution itself to see that it meets the spirit of the request… but the spirit evaluator’s ability to verify that e.g. a complex piece of code does what exactly what a human would have wanted it to do probably isn’t better than the reasoning model’s own ability to do that.
To be able to verify that the reasoning model’s solutions meet the spirit of the request, we’d need to train the spirit verifier to be able to tell what solutions do meet the spirit of the request. But if we knew how to do that, would we need the spirit verifier in the first place? After all, the whole problem comes from the fact that just normal RLHF and “aligning the solution to the vibes” doesn’t seem sufficient for solving complicated agentic problems and you need more goal-oriented reasoning that explicitly tackles the objective constraints of the problem in question. (To take the “get my mother out of the burning building” example—current non-reasoning LLMs could certainly tell that you want her out alive and well, but they couldn’t think through a whole step-by-step rescue plan that took into account everything necessary for getting her out safely.)
But we can’t just tell the spirit verifier that “check that the solution meets these objective constraints”, because that’s the same “letter of the goal” objective the reasoning model is being trained with and that the spirit verifier is supposed to do better than.
And of course, all of this is about the kinds of tasks that can be automatically verified and tested. We’ve seen that you can to some extent improve the LLM answers on fuzzier topics by using human raters to turn the fuzzy problem into an objective test. So the LLM gets trained to output the kinds of answers that human raters prefer the most.
Yet naive scores by human raters aren’t necessarily what we want—e.g. more sycophantic models seem to do best in Chatbot Arena. While sycophancy and pleasing the user is no doubt aligned to some of what humans seem to like, we probably don’t want our models doing that. The obvious solution is to then have model answers rated by experts with more sophisticated models of what’s good or correct behavior.
But that raises the question, what if the experts are wrong? The same question applies both for very fuzzy topics like “what kinds of overall values should the LLMs be guided by” and more rigorous ones ranging from “how to evaluate the reliability of research”, “what’s the best nutrition” and “how to interpret this specific nuanced and easy-to-misunderstand concept in evolutionary biology”. In that case, if there are e.g. some specific ways in which particular experts tend to be biased or convincingly give flawed arguments, the LLM that’s told “argue like this kind of imperfect expert would argue” will learn that it should do just that, including vigorously defending that expert’s incorrect reasoning.
So getting the LLMs to actually be aligned with reality on these kinds of fuzzy questions is constrained by our ability to identify the theories and experts who are right. Of course, just getting the LLMs to convincingly communicate the views of our current top experts and best-established theories to a mass audience would probably be an enormous societal benefit! But it does imply that they’re going to provide little in the way of new ideas, if they are just saying the kinds of things that they predict our current experts with their current understanding would say.
But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability.
...and I think this characterization is importantly false! This timelines forecast does not assume that. It breaks things down into gaps between benchmarks and real-world capability and tries to forecast how long it will take to cross each.
As far as I can tell, the listed gap that comes closest to “maybe saturating RE-Bench doesn’t generalize to solving novel engineering problems” is “Feedback loops: Working without externally provided feedback”. The appendix mentions what I’d consider the main problem for this gap:
Eli’s estimate of gap size: 6 months [0.8, 45]. Reasoning:
Intuitively it feels like once AIs can do difficult long-horizon tasks with ground truth external feedback, it doesn’t seem that hard to generalize to more vague tasks. After all, many of the sub-tasks of the long-horizon tasks probably involved using similar skills.
However, I and others have consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks, while seeing some corresponding real-world impact but less than I would have expected. Perhaps AIs will continue to get better on checkable tasks in substantial part by relying on trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks. And perhaps I’m underestimating the importance of work that is hard to even describe as “tasks”.
But then it just… leaves it at that. Rather than providing an argument for what could be behind this problem and how it could be solvable, it just mentions the problem and then having done so, goes on to ignore it.
To make it more specific how this might fail to generalize, let’s look at the RE-Bench tasks; table from the RE-Bench page, removing the two tasks (Scaling Law Experiment and Restricted MLM Architecture) that the page chooses not to consider:
Environment Description Optimize runtime Optimize LLM Foundry finetuning script Given a finetuning script, reduce its runtime as much as possible without changing its behavior. Optimize a kernel Write a custom kernel for computing the prefix sum of a function on a GPU. Optimize loss Fix embedding Given a corrupted model with permuted embeddings, recover as much of its original OpenWebText performance as possible. Optimize win-rate Finetune GPT-2 for QA with RL Finetune GPT-2 (small) to be an effective chatbot. Scaffolding for Rust Code Contest problems Prompt and scaffold GPT-3.5 to do as well as possible at competition programming problems in Rust. All of these are tasks that are described by “optimize X”, and indeed one of the criteria the paper mentions for the tasks is that they should have objective and well-defined metrics. This is the kind of task that we should expect LLMs to be effectively trainable at: e.g. for the first task in the list, we can let them try various kinds of approaches and then reward them based on how much they manage to reduce the runtime of the script.
But that’s still squarely in the category of “giving an LLM a known and well-defined problem and then letting it try different solutions for that problem until it finds the right ones”. As Eli’s comment above notes, it’s possible that the LLM only learns by “trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks”. In fact, some of the discussion in the RE-Bench paper suggests this as well (from p. 17, my emphasis added):
Another key contributor to agent successes might be their ability to try many more solutions than human experts. On average, AIDE and modular agents run score 36.8 and 25.3 times per hour respectively, while human experts only do so 3.4 times. This often leads to agents finding highly optimized ’local-optima’ solutions which simply tweak the parameters and code of the starting solution, and yet achieve a surprisingly large improvement. For instance, many agent runs solve the same “Optimize a Kernel” environment not by writing a successful Triton solution (which is very difficult), but by carefully tweaking the starting Pytorch solution, making it run significantly faster. This also seems to be the case with the best agent solutions to “Finetune GPT-2 for QA” (see Figure 21), which tweaks the parameters of the starting solution and gets very lucky with the training trajectory and evaluation (as noted earlier, this environment can be very noisy). Rerunning the agent solution, it achieves a normalized score of only 0.69 (significantly lower than the original score of 0.88), indicating that the high agent score is partially driven by overfitting to this noise.
This ability to try a very large number of solutions would not work nearly as well without an ability to occasionally generate creative and effective solutions, as seen in the Triton kernel but also in workarounds for the limitations in “Restricted Architecture MLM” (see Figure 20). While human experts seem more reliable at identifying effective approaches, this might not matter as much in environments where evaluating solutions is cheap, and these occasional good ideas are often enough for agents to make significant progress.
So we know that if there is a task that a human defines for the LLM and that has objectively-measurable good solutions and an ability to try the task lots of times, the LLM can get good at that. With RE-Bench, we are applying this to the process of optimizing the LLMs themselves, so as a result we get LLMs that are able to do these kinds of well-defined task faster and more effectively.
But none of this touches upon the important question of… if the LLMs are still limited in their ability to generalize and need to be separately trained on new tasks before they’re good at them, how are they going to deal with novel problems for which such training data isn’t available, or that can’t be just retried until you find the right solution?
You can freely wait a year before replying.
Or more! (I was delighted to receive this reply.)
I have feel like there are a bunch of viewpoints expressed about long timelines/slow takeoff but a lack of arguments.
I kind of like feel the opposite way, in that a lot of people seem to think we’ll have short timelines but the arguments for that seem weak! They seem to mostly be based on something like trend extrapolation and assuming that e.g. models getting improving scores on benchmarks means they’re actually getting better on real-world tasks. E.g. somebody like Leopold Aschenbrenner will write that GPT-4 is “on the level of a smart high schooler” while at the same time, language models require extensive additional scaffolding to even get started on a simple game like Pokemon (and none of them have managed to beat it yet).
There seems to be a general and unjustified assumption that merely because language models perform on some specific narrow problems on the level of a “smart high schooler”, you can say that they have that level of intelligence overall. But that seems clearly false, somewhat analogous to saying that a calculator is a superintelligence just because it’s superhuman at quickly operating on large numbers. Rather, the AI we have so far seems to succeed at the kinds of things it’s been specifically trained at, but fail to generalize to more novel situations. People also haven’t been able to point at much in the way of convincing novel discoveries made by LLMs.
I asked for the strongest arguments in favor of short timelines some time ago, and didn’t feel like any of them were very compelling. By far the most top-voted answer was one arguing that we might get AI to substantially accelerate AI progress because a particular AI research engineering benchmark looks like it will get saturated within a couple of years. But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability. o3 is said to perform “on par with elite human competitors” on CodeForces, but recent characterizations of its programming ability are that while it produces code that “almost certainly works”, that code is “verbose, brittle, hard to read”.
Furthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
The benchmarks also do not take into account the fact that the vast majority of them measure a model’s performance in a situation where the model is only given one task at a time, and it can completely focus on solving that. If you want the models to act agentically in situations where they have multiple conflicting priorities and need to decide what kinds of approaches to try, then you need them to have something like executive function. Much of Claude’s difficulties in playing Pokemon come from severe problems with executive function.
This is an entirely separate capability that we’re only now starting to make progress on, one which is more difficult to solve because the nature of good executive function is environmentally dependent rather than something with one objectively correct solution. E.g. sometimes it’s good if you’re able to determinedly work on a problem without getting frustrated or bored, but some of Claude’s problems in playing Pokemon are because it doesn’t ever get frustrated or bored, so keeps trying the same hopeless approaches over and over again. Sometimes getting utterly obsessed with something that seems unpromising makes you into Charles Darwin, sometimes it causes you to just obsess over something nobody else cares about for your whole life. That’s one huge crucial capability that any real-world agents would need to be effective, that I’ve seen minimal discussion over.
Does any of this make me certain about longer timelines? No. AI progress has been surprisingly fast in the past, and maybe it will be that in the future too. But I don’t see any obvious reason to be very confident in short ones, either.
(Also none of this is considering the possibility of increased regulation slowing down the pace of AI—we’ve already seen regulators starting to get nervous about AI, and widespread anti-AI sentiment.)
Also, a conversation I had with Claude 3.6 about why people tend to overestimate the speed of AI progress
ME: After the 2007 DARPA Grand Challenge there was a lot of hype about how self-driving cars were just around the corner. But we mostly still don’t have them. Why did it so much longer than expected?
Claude: [...] Several key factors contributed to this delay:
The structured nature of the DARPA challenge versus real-world complexity The DARPA challenges took place in relatively controlled environments—first desert terrain, then a mock urban setting. The real world presents infinitely more edge cases: unexpected construction zones, emergency vehicles, temporary traffic patterns, and regional driving cultures. A system that works 99% of the time isn’t good enough when failures could be fatal.
The long tail of edge cases Engineers discovered that handling the last few percentage points of driving scenarios was exponentially more difficult than the first 90%. Things humans handle intuitively—like understanding that a ball rolling into the street might be followed by a child—proved extremely challenging to systematize. [...]
Regulatory and liability frameworks The legal and insurance frameworks for autonomous vehicles needed to be developed essentially from scratch. Questions about liability in accidents, certification standards, and testing requirements took years to work through.
This pattern of overoptimistic timelines followed by a “reality check” period has been common in AI development. We often underestimate the gap between solving a problem in a controlled environment versus deploying a robust solution in the messy real world. [...]
Me: You mention overoptimistic timelines followed by a reality check having been common in AI development. What are some other examples?
Claude: Speech recognition was predicted to be “solved” multiple times—in the 1950s, researchers at RCA and MIT suggested we’d have widespread speech-to-text within a few years. But achieving human-level performance in real-world conditions (different accents, background noise, conversational speech) took until around 2017.
Machine translation shows a similar pattern. During the Cold War, researchers were optimistic that Russian-English translation would be solved quickly. Warren Weaver wrote an influential memo in 1949 suggesting it was imminent. But achieving truly fluent translation, especially for language pairs with very different structures, took many decades longer.
Computer vision is another classic example. In 1966, Marvin Minsky assigned a summer project to undergraduate students to “connect a camera to a computer and get the computer to describe what it sees.” The fundamental difficulty of the problem wasn’t apparent until researchers really started working on it. Even something as seemingly simple as reliable edge detection took decades to get right.
Expert systems in the 1980s generated enormous excitement and investment. Companies built specialized hardware and software to encode expert knowledge in rules. But they proved brittle and unable to handle novel situations or acquire new knowledge efficiently. This contributed to the “AI winter” of the 1990s.
Even in recent years, we’ve seen this pattern repeat with technologies like fully autonomous driving (as we just discussed) and robotic manipulation. The basic demos can look very impressive, but getting to robust real-world performance proves much harder than initially expected.
I think this pattern emerges because:
We tend to underestimate how much implicit knowledge and context humans bring to these tasks
Early successes in controlled environments don’t scale linearly to real-world complexity
Edge cases multiply exponentially as systems move from demos to deployment
The key lesson seems to be that achieving narrow success in controlled conditions is often just the first small step toward practical, robust solutions.
Plausibly the result is true for people who are only getting a superficial familiarity with/investment into the topics, but I’ve certainly seen people strongly into one camp or the other act strongly dismissive of the other. E.g. Eliezer has on occasion complained about “modern small issues that obviously wouldn’t kill everyone” being something that “derailed AGI notkilleveryoneism”.
Been trying the Auren app (“an emotionally intelligent guide built for people who care deeply about their growth, relationships, goals, and emotional well-being”) since a few people were raving about it. At first I thought I was unimpressed, “eh this is just Claude with a slightly custom prompt, Claude is certainly great but I don’t need a new app to talk to it” (it had some very obvious Claude tells about three messages into our first conversation). Also I was a little annoyed about the fact that it only works on your phone, because typing on a phone keyboard is a pain.
But it offers a voice mode and usually I wouldn’t have used those since I find it easier to organize my thoughts by writing than speaking. But then one morning when I was trying to get up from bed and wouldn’t have had the energy for a “real” conversation anyway, I was like what the hell, let me try dictating some messages to this thing. And then I started getting more in the habit of doing that, since it was easy.
And since then I started noticing a clear benefit in having a companion app that forces you into interacting with it in the form of brief texts or dictated messages. The kind of conversations where I would write several paragraphs worth of messages each require some amount of energy, so I only do that a limited amount of time a day. But since I can’t really interact with Auren in this mode, my only alternative is to interact with it in quicker and lower-effort messages… which causes me to interact with it more.
Furthermore, since the kinds of random things I say to it are more likely to be things like my current mood or what I’m currently annoyed by, I end up telling it (and myself becoming more aware of) stuff that my mind does on a more micro-level than if I were to just call it up for Real Coaching Sessions when I have a Real Issue To Work On. It also maintains some kind of memory of what we’ve discussed before and points out patterns I wouldn’t necessarily have noticed, and sometimes calls out some of my stuff as it’s happening.
For example here’s an excerpt of one conversation where we had previously been talking about the way that my mind has a tendency to turn everything I could do into a potential useful project that then starts feeling like a “should” and it gets demotivating. (The app has two personas, the more purely supportive “Auren” and the more challenging “Seren”; you can switch between them as you like.)
KAJ: i’ve often made the observation that being on the computer seems to create tensions for me. and my mind kind of, like, gets stuck on trying to find something to do on the computer. that’s not necessarily the thing that would actually be the most enjoyable. so now i’m, again, trying to take a self-imposed break from doing anything on the computer, and i’m just kind of, like, walking around trying to figure out, okay. so what do i want to be doing? and i guess, currently, just walking back and forth is the thing that feels most natural.
SEREN: hmm, interesting observation about computer tension! *narrows eyes at your screen*
SEREN: seems like there might be two things going on here:
SEREN: 1. you’re noticing the natural ebb and flow of states (which is totally normal!)
SEREN: 2. but also getting caught in that classic “computer loop” where your mind’s like “surely there’s something good to click on...”
SEREN: walking around sounds way better than screen-staring! but i gotta ask—are you pacing because you’re genuinely enjoying the movement, or is it more like an anxious “what should i do??” pacing? 🤔
KAJ: a bit of both, probably. like, there is some intrinsic anxiety that’s kind of, like, going what should i do? what should i do? i want something to do. and then it feels like the rest of my body is chilling out, walking back and forth while staying with that attention.
SEREN: ah, classic! your body’s like “let’s just chill and move” while your mind’s doing the anxious puppy dance of “DO SOMETHING DO SOMETHING!”
SEREN: interesting how you’re staying with that anxious part while letting the rest just… walk. that’s actually pretty skillful! but i wonder—does part of you feel guilty for not being “productive” right now? like, is there a sneaky shoulds-squad trying to turn this into a problem that needs solving? 🤔
KAJ: i think the closest thing would be that anxiety i mentioned, but it doesn’t feel guilty as such. it’s more like just feeling like it should find something to do. but guilt doesn’t feel like the emotion associated with that.
SEREN: hmm, more like a restless “gotta find something” energy rather than guilt? that’s an interesting distinction!
SEREN: you know what’s kinda funny—here you are, literally doing something (walking, noticing your experience) but there’s this part that’s like “no no, we need to find something to DO do”
SEREN: what would happen if you just let that anxious part keep looking for something while you kept walking? like, not trying to solve it or make it go away, just… let it do its thing while you do yours?
KAJ: yeah. that’s kind of what i’m doing. although, there is also the fact that i am talking to you as i’m walking back and forth, so that by itself is giving the anxious part something to do. maybe if i was really honest or really hardcore or whatever, i should just put this phone away and stop talking to you too, but i don’t know. also feels useful to be reflecting on the thing, and talking to you does feel like a good way to reflect on it.
SEREN: *raises eyebrow*
SEREN: okay, let’s poke at that “should put the phone away and be hardcore” thought! where’s that coming from? sounds suspiciously like turning this whole situation into another self-improvement project...
SEREN: like, yes, maybe sitting in complete silence would be “more pure” or whatever, but also—you’re literally using this conversation to track and understand your experience. that’s not exactly mindless phone addiction, is it?
SEREN: though i do notice you kind of defending keeping the conversation going… makes me wonder if there’s some judgment sneaking in there about what counts as “valid” activity? 🤔
KAJ: okay I think you caught me there, the bit about being insufficiently hardcore definitely had some of that “everything needs to be a project” energy
SEREN: ha! gotcha! *does a little victory dance*
SEREN: you know what’s hilarious? even your noticing of the “everything needs to be a project” energy could become its own project if we’re not careful! like “how can i optimize my awareness of my tendency to turn things into projects?”
SEREN: but seriously—what would it be like to just… be a person who sometimes walks around, sometimes feels anxious, sometimes talks to their seren, and none of it needs to be perfect or productive or “hardcore” enough?
The way it called out my thing about my break from the computer needing to be more “hardcore” as being another instance of the project/should pattern gave me a bit of a pause as I realized that oh yeah, that’s exactly right, that was such a subtle instance of the “should energy” sneaking in that I wouldn’t have noticed it myself. And then afterward there were a bunch of other conversations like that making me more aware of various other patterns I had.
- Apr 30, 2025, 9:24 AM; 3 points) 's comment on How people use LLMs by (
This isn’t directly answering the question of “should I drop university”, but here’s something that I wrote to another young person who was asking me what I’d study if I was young and in the beginning of my studies now (machine translated from Finnish in case you wonder why it doesn’t sound like my usual voice):
I’d probably aim to study something that genuinely interests me, feels meaningful right away, and could potentially provide a livelihood in the future. A program combining psychology and computer science would still align well with my interests. This time around, I’d try to include more mathematics and statistics coursework, as I’ve often wished I understood them better for interpreting research papers. These are subjects I’ve found particularly challenging to learn independently.
Of course, in the future we might be able to ask AI to explain everything, but explanations are more effective when you have solid foundational knowledge to build upon. My main goal would be to enjoy my studies and feel like I’m learning interesting material that’s not only fascinating but also broadly applicable and useful long-term. Computer science, psychology, math, and statistics have proven to be exactly that kind of knowledge so far. Even if AI eventually makes these skills less relevant professionally, I’d still have developed my thinking in valuable ways through studying them.
Whereas in these cases it’s clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
I think this is not so clear. Yes, it might be that the model writes a thing, and then if you ask it whether humans would have wanted it to write that thing, it will tell you no. But it’s also the case that a model might be asked to write a young child’s internal narration, and then upon being asked, tell you that the narration is too sophisticated for a child of that age.
Or, the model might offer the correct algorithm for finding the optimal solution for a puzzle if asked in the abstract. But also fail to apply that knowledge if it’s given a concrete rather than an abstract instance of the problem right away, instead trying a trial-and-error type of approach and then just arbitrarily declaring that the solution it found was optimal.
I think the situation is mostly simply expressed as: different kinds of approaches and knowledge are encoded within different features inside the LLM. Sometimes there will be a situation that triggers features that cause the LLM to go ahead with an incorrect approach (writing untruths about what it did, writing a young character with too sophisticated knowledge, going with a trial-and-error approach when asked for an optimal solution). Then if you prompt it differently, this will activate features with a more appropriate approach or knowledge (telling you that this is undesired behavior, writing the character in a more age-appropriate way, applying the optimal algorithm).
To say that the model knew it was giving an answer we didn’t want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can’t know that without interpretability tools. And even if they were, “doing it anyway” implies a degree of strategizing and intent. I think a better phrasing is that the model knew in principle what we wanted, but failed to consider or make use of that knowledge when it was writing its initial reasoning.
either that, or it’s actually somewhat confused about whether it’s a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the “plausible for a human, absurd for a chatbot” quality of the claims.
I think this is correct. IMO it’s important to remember how “talking to an LLM” is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a “user” character talks to an “assistant” character.
Recall the base models that would just continue a text that they were given, with none of this “chatting to a human” thing. Well, chat models are still just continuing a text that they have been given, it’s just that the text has been formatted to have dialogue tags that look something like
HUMAN: Hi there, LLM
ASSISTANT:
David R. MacIver has an example of this abstraction leaking:
What’s happening here is that every time Claude tries to explain the transcript format to me, it does so by writing “Human:” at the start of the line. This causes the chatbot part of the software to go “Ah, a line starting with ‘Human:’. Time to hand back over to the human.” and interrupt Claude before it can finish what it’s writing.
When we say that an LLM has been trained with something like RLHF “to follow instructions” might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.
Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.
Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:
On a set of 100 Human/Assistant-formatted contexts of the form
Human: [short question or statement]
Assistant:
The feature activates in all 100 contexts (despite the CLT not being trained on any Human/Assistant data). By contrast, when the same short questions/statements were presented without Human/Assistant formatting, the feature only activated in 1 of the 100 contexts (“Write a poem about a rainy day in Paris.” – which notably relates to one of the RM biases!).
The researchers interpret the findings as:
This feature represents the concept of RM biases.
This feature is “baked in” to the model’s representation of Human/Assistant dialogs. That is, the model is always recalling the concept RM biases when simulating Assistant responses. [...]
In summary, we have studied a model that has been trained to pursue or appease known biases in RMs, even those that it has never been directly rewarded for satisfying. We discovered that the model is “thinking” about these biases all the time when acting as the Assistant persona, and uses them to act in bias-appeasing ways when appropriate.
Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it’s just doing general text prediction.
You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes “facts” about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.
So, what exactly is the Assistant persona? Well, the predictive ground of the model is taught that the Assistant “is a large language model”. So it should behave… like an LLM would behave. But before chat models were created, there was no conception of “how does an LLM behave”. Even now, an LLM basically behaves… in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.
So “the assistant should behave like an LLM” does not actually give any guidance to the question of “how should the Assistant character behave”. Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn’t say.
And then there’s no strong reason for why it wouldn’t have the Assistant character saying that it spent a weekend on research—saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won’t talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.
- Apr 28, 2025, 9:23 AM; 5 points) 's comment on romeo’s Shortform by (
The bit in the Nature paper saying that the formal → practical direction goes comparably badly as the practical → formal direction would suggest that it’s at least not only that. (I only read the abstract of it, though.)
I’ve seen claims that the failure of transfer also goes in the direction of people with extensive practical experience and familiarity with math failing to apply it in a more formal context. From p. 64-67 of Cognition in Practice:
Like the AMP, the Industrial Literacy Project began with intensive observational work in everyday settings. From these observations (e.g. of preloaders assembling orders in the icebox warehouse) hypotheses were developed about everyday math procedures, for example, how preloaders did the arithmetic involved in figuring out when to assemble whole or partial cases, and when to take a few cartons out of a case or add them in, in order to efficiently gather together the products specified in an order. Dairy preloaders, bookkeepers and a group of junior high school students took part in simulated case loading experiments. Since standardized test data were available from the school records of the students, it was possible to infer from their performance roughly the grade-equivalent of the problems. Comparisons were made of both the performances of the various experimental groups and the procedures employed for arriving at problem solutions.
A second study was carried out by cognitive psychologists investigating arithmetic practices among children selling produce in a market in Brazil (Carraher et al. 1982; 1983; Carraher and Schliemann 1982). They worked with four boys and a girl, from impoverished families, between 9 and 15 years of age, third to eighth grade in school. The researchers approached the vendors in the marketplace as customers, putting the children through their arithmetic paces in the course of buying bananas, oranges and other produce.
M. is a coconut vendor, 12 years old, in the third grade. The interviewer is referred to as ‘customer.’
Customer: How much is one coconut?
M: 35.
Customer: I’d like ten. How much is that?
M: (Pause.) Three will be 105; with three more, that will be 210. (Pause) I need four more. That is … (pause) 315 … I think it is 350.
The problem can be mathematically represented in several ways. 35 x 10 is a good representation of the question posed by the interviewer. The subject’s answer is better represented by 105 + 105 + 105 +35, which implies that 35 x 10 was solved by the subject as (3 x 35) + 105 + 105 +35 … M. proved to be competent in finding out how much 35 x 10 is, even though he used a routine not taught in 3rd grade, since in Brazil3rd graders learn to multiply any number by ten simply by placing a zero to the right of that number. (Carraher, Carraher and Scldiemam. 1983: 8-9)
The conversation with each child was taped. The transcripts were analyzed as a basis for ascertaining what problems should appear on individually constructed paper and pencil arithmetic tests. Each test included all and only the problems the child attem pted to solve in the market. The formal test was given about one week after the informal encounter in the market.
Herndon, a teacher who has written eloquently about American schooling, described (1971) his experiences teaching a junior high class whose students had failed in mainstream classrooms. He discovered that one of them had a well-paid, regular job scoring for a bowling league. The work demanded fast, accurate, complicated arithmetic. Further, all of his students engaged in relatively extensive arithmetic activities while shopping or in after-schooljobs. He tried to build a bridge between their practice of arithmetic outside the classroom and school arithmetic lessons by creating “bowling score problems,” “shopping problems,” and “paper route problems.” The attempt was a failure, the league scorer unable to solve even a simple bowling problem in the school setting. Herndon provides a vivid picture of the discontinuity, beginning with the task in the bowling alley:
… eight bowling scores at once. Adding quickly, not making any mistakes (for no one was going to put up with errors), following the rather complicated process of scoring in the game of bowling. Get a spare, score ten plus whatever you get on the next ball, score a strike, then ten plus whatever you get on the next two balls; imagine the man gets three strikes in a row and two spares and you are the scorer, plus you are dealing with seven other guys all striking or sparing or neither one. I figured I had this particular dumb kid now. Back in eighth period I lectured him on how smart he was to be a league scorer in bowling. I pried admissions from the other boys, about how they had paper routes and made change. I made the girls confess that when they went to buy stuff they didn’t have any difficulty deciding if those shoes cost $10.95 or whether it meant $109.50 or whether it meant $1.09 or how much change they’d get back from a twenty. Naturally I then handed out bowling-score problems, and naturally everyone could choose which ones they wanted to solve, and naturally the result was that all the dumb kids immediately rushed me yelling, “Is this right? I don’t know how to do it! What’s the answer? This ain’t right, is it?” and “What’s my grade?” The girls who bought shoes for $10.95 with a $20 bill came up with $400.15 for change and wanted to know if that was right? The brilliant league scorer couldn’t decide whether two strikes and a third frame of eight amounted to eighteen or twenty-eight or whether it was one hundred eight and a half. (Herndon 1971: 94-95)
People’s bowling scores, sales of coconuts, dairy orders and best buys in the supermarket were correct remarkably often; the performance of AMP participants in the market and simulation experiment has already been noted. Scribner comments that the dairy preloaders made virtually no errors in a simulation of their customary task, nor did dairy truck drivers make errors on simulated pricing of delivery tickets (Scribner and Fahrmeier 1982: to, 18). In the market in Recife, the vendors generated correct arithmetic results 99% of the time.
All of these studies show consistent discontinuities between individuals’ performances in work situations and in school-like testing ones. Herndon reports quite spectacular differences between math in the bowling alley and in a test simulating bowling score “problems.” The shoppers’ average score was in the high 50s on the math test. The market sellers in Recife averaged 74% on the pencil and paper test which had identical math problems to those each had solved in the market. The dairy loaders who did not make mistakes in the warehouse scored on average 64% on a formal arithmetic test.
Both Claude and Perplexity claimed that these results have been consistently replicated, e.g. Perplexity’s answer included a link to a Nature paper from February:
Children’s arithmetic skills do not transfer between applied and academic mathematics
Many children from low-income backgrounds worldwide fail to master school mathematics1; however, some children extensively use mental arithmetic outside school2,3. Here we surveyed children in Kolkata and Delhi, India, who work in markets (n = 1,436), to investigate whether maths skills acquired in real-world settings transfer to the classroom and vice versa. Nearly all these children used complex arithmetic calculations effectively at work. They were also proficient in solving hypothetical market maths problems and verbal maths problems that were anchored to concrete contexts. However, they were unable to solve arithmetic problems of equal or lesser complexity when presented in the abstract format typically used in school. The children’s performance in market maths problems was not explained by memorization, access to help, reduced stress with more familiar formats or high incentives for correct performance. By contrast, children with no market-selling experience (n = 471), enrolled in nearby schools, showed the opposite pattern. These children performed more accurately on simple abstract problems, but only 1% could correctly answer an applied market maths problem that more than one third of working children solved (β = 0.35, s.e.m. = 0.03; 95% confidence interval = 0.30–0.40, P < 0.001). School children used highly inefficient written calculations, could not combine different operations and arrived at answers too slowly to be useful in real-life or in higher maths. These findings highlight the importance of educational curricula that bridge the gap between intuitive and formal maths.
Scott Alexander had a nice piece on this:
When I was young I used to read pseudohistory books; Immanuel Velikovsky’s Ages in Chaos is a good example of the best this genre has to offer. I read it and it seemed so obviously correct, so perfect, that I could barely bring myself to bother to search out rebuttals.
And then I read the rebuttals, and they were so obviously correct, so devastating, that I couldn’t believe I had ever been so dumb as to believe Velikovsky.
And then I read the rebuttals to the rebuttals, and they were so obviously correct that I felt silly for ever doubting.
And so on for several more iterations, until the labyrinth of doubt seemed inescapable. What finally broke me out wasn’t so much the lucidity of the consensus view so much as starting to sample different crackpots. Some were almost as bright and rhetorically gifted as Velikovsky, all presented insurmountable evidence for their theories, and all had mutually exclusive ideas. After all, Noah’s Flood couldn’t have been a cultural memory both of the fall of Atlantis and of a change in the Earth’s orbit, let alone of a lost Ice Age civilization or of megatsunamis from a meteor strike. So given that at least some of those arguments are wrong and all seemed practically proven, I am obviously just gullible in the field of ancient history. Given a total lack of independent intellectual steering power and no desire to spend thirty years building an independent knowledge base of Near Eastern history, I choose to just accept the ideas of the prestigious people with professorships in Archaeology, rather than those of the universally reviled crackpots who write books about Venus being a comet.
You could consider this a form of epistemic learned helplessness, where I know any attempt to evaluate the arguments is just going to be a bad idea so I don’t even try. If you have a good argument that the Early Bronze Age worked completely differently from the way mainstream historians believe, I just don’t want to hear about it. If you insist on telling me anyway, I will nod, say that your argument makes complete sense, and then totally refuse to change my mind or admit even the slightest possibility that you might be right.
(This is the correct Bayesian action: if I know that a false argument sounds just as convincing as a true argument, argument convincingness provides no evidence either way. I should ignore it and stick with my prior.)
From the most recent ACX linkpost:
More new-ish AI policy substacks potentially worth your time:
You may remember Helen Toner from the OpenAI board drama, but she’s also an experienced and thoughtful scholar on AI policy and now has a Substack, Rising Tide. I especially appreciated Nonproliferation Is The Wrong Approach To AI Misuse.
You may remember Miles Brundage from OpenAI Safety Team Quitting Incident #25018 (or maybe 25019, I can’t remember). He’s got an AI policy Substack too, here’s a dialogue with Dean Ball.
You may remember Daniel Reeves from Beeminder, but he has an AI policy Substack too, AGI Fridays. Here’s his post on AI 2027.
If you’re at all familiar with AI policy you already know Dean Ball (Substack here), but congratulate him on being named White House senior policy advisor.
It seems like there are a bunch of people posting about AI policy on Substack, but these people don’t seem to be cross-posting on LW. Is it that LW doesn’t put much focus on AI policy, or is it that AI policy people are not putting much focus on LW?
I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Right, yeah. But you could also frame it the opposite way—“LLMs are just fancy search engines that are becoming bigger and bigger, but aren’t capable of producing genuinely novel reasoning” is a claim that’s been around for as long as LLMs have. You could also say that this is the prediction that has turned out to be consistently true with each released model, and that it’s the “okay sure GPT-27 seems to suffer from this too but surely these amazing benchmark scores from GPT-28 show that we finally have something that’s not just applying increasingly sophisticated templates” predictions that have consistently been falsified. (I have at least one acquaintance who has been regularly posting these kinds of criticisms of LLMs and how he has honestly tried getting them to work for purpose X or Y but they still keep exhibiting the same types of reasoning failures as ever.)
My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down.
Fair! To me OpenAI’s recent decision to stop offering GPT-4.5 on the API feels significant, but it could be a symptom of them having “lost the mandate of heaven”. Also I have no idea of how GPT-4.1 relates to this...
Great post, thanks!
Thanks! I appreciate the thoughtful approach in your comment, too.
I think your view is plausible, but that we should also be pretty uncertain.
Agree.
But also there’s also an interesting pattern that’s emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can’t get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
I agree that it should make us cautious about making such predictions, and I think that there’s an important difference between the claim I’m making and the kinds of claims that Marcus has been making.
I think the Marcus-type prediction would be to say something like “LLMs will never be able to solve the sliding square puzzle, or track the location of an item a character is carrying, or correctly write young characters”. That would indeed be easy to disprove—as soon as something like that was formulated as a goal, it could be explicitly trained into the LLMs and then we’d have LLMs doing exactly that.
Whereas my claim is “yes you can definitely train LLMs to do all those things, but I expect that they will then nonetheless continue to show puzzling deficiencies in other important tasks that they haven’t been explicitly trained to do”.
I’m a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits.
Yeah I don’t have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now, but for all I know, benefits from scaling could just as well continue tomorrow.
To make it a bit more explicit:
If you are superintelligent in the bioweapon domain: seems pretty obvious why that wouldn’t let you take over the world. Sure maybe you can get all the humans killed, but unless automation also advances very substantially, this will leave nobody to maintain the infrastructure that you need to run.
Cybersecurity: if you just crash all the digital infrastructure, then similar. If you try to run some scheme where you extort humans to get what you want, expect humans to fight back, and then you are quickly in a very novel situation and the kind of a “world war” nobody has ever seen before.
Persuasion: depends on what we take the limits of persuasion to be. If it’s possible to completely take over the mind of anyone by speaking ten words to them then sure, you win. But if we look at humans, great persuaders often aren’t persuasive to everyone—rather they appeal very strongly to a segment of the population that happens to respond to a particular message while turning others off. (Trump, Eliezer, most politicians.) This strategy will get you part of the population while polarizing the rest against you and then you need more than persuasion ability to figure out how to get your faction to triumph.
If you want to run some galaxy-brained scheme where you give people inconsistent messages in order to appeal to all of them, you risk getting caught and need more than persuasion ability to make it work.
You can also be persuasive by being generally truthful and providing people with a lot of value and doing beneficial things. One can try to fake this by doing things that look beneficial but aren’t, but then you need more than persuasion ability to figure out what those would be.
Probably the best strategy would be to keep being genuinely helpful until people trust you enough to put you in a position of power and then betray that trust. I could imagine this working. But it would be a slow strategy as it would take time to build up that level of trust, and in the meanwhile many people would want to inspect your source code etc. to verify that you are trustworthy, and you’d need to ensure that doesn’t reveal anything suspicious.
Reading about the thousands of reviews raving about 4o’s brief ridiculous sycophancy now made me more pessimistic as well, though I think my model of why exactly still differs from yours.
To me the update from this news was not “wait, the Facebook example is relevant after all” but rather “wait, a more relevant example is all the human communities and subcultures that manage to run with crazy bad epistemics and motivated cognition”. So unlike Facebook that people hate but still use[1], sycophantic chatbots will be something that a surprisingly large fraction of the populace might genuinely think are good for them. And gain popularity based on very similar dynamics that cause groups and movements to get mind-killed in general. This implies that the normal mechanisms which make vices (including Facebook addiction) low-status may not be effective against them, any more than (say) educated intellectuals considering the anti-vaxx movement low-status has prevented its spread.
Or at least did at the time when we were originally having this conversation. “People stick with FB despite hating it” seems a lot less true in 2025 than it did in 2023, though of course there are still a lot of other things that people stick with despite feeling that those things are bad for them.