I’m trying out podcasting as a format for the ideas I share here and on the blog. Keen to hear if people think it translates well, or needs more tweaking—do you need to be more verbose in a spoken form to allow more time for absorption? Any ideas how to clearly describe payoff matrices in an audio format… tear it apart guys.
Which seems like an important idea in general, but especially in interpretability for safety.
Specifically, error taxonomy is a subset of action by consequence taxonomy, which is the main goal of interpretability for safety (as it allows us to act on the fact that the model will take actions with bad consequences).
I have found that when using Anki for words/language learning, I frequently can’t remember the correct translation exactly, but can guess the translation as one of top-3 options. In fact, this works well for me—even knowing vaguely what the word means is very useful.
does anyone else uses Anki with non-exact answers?
First up, I thought this line was strikingly poetic for a technical topic, and would be above-average quality even for a human technical writer.
So you’re absolutely right that it’s a different kind of environment — but it’s been engineered to work well for fungal metabolism. Nature grows molds on oranges; industry grows them in soup with bubbles and stainless steel.
I’m enjoying being curious about the world around me with the benefit of being able to ask an endlessly patient expert. Go ahead and ask your favorite LLM how the citric acid in your gatorade is made.
In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.
Suppose that K is either the field of real numbers, complex numbers, or quaternions.
If A1,…,Ar∈Mm(K),B1,…,Br∈Mn(K) are matrices, then define an superoperator Γ(A1,…,Ar;B1,…,Br):Mm,n(K)→Mm,n(K)
by settingΓ(A1,…,Ar;B1,…,Br)(X)=A1XB∗1+⋯+ArXB∗r
(the domain and range of )and define Φ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar). Define the L_2-spectral radius similarity ∥(A1,…,Ar)≃(B1,…,Br)∥2 by setting
∥(A1,…,Ar)≃(B1,…,Br)∥2
=ρ(Γ(A1,…,Ar;B1,…,Br))ρ(Φ(A1,…,Ar))1/2ρ(Φ(B1,…,Br))1/2 where ρ denotes the spectral radius.
Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space V together with a bilinear binary operation ∗ such that∥x∗y∥=∥x∥⋅∥y∥ and 1∗x=x∗1=x for all x,y∈V.
Suppose that e1,…,e8 is an orthonormal basis for V. Define operators (A1,…,A8) by setting Aiv=ej∗v. Now, define operators (B1,…,B64) up to reordering by setting {B1,…,B64}={Ai⊗Aj:i,j∈{1,…,8}}.
Let d be a positive integer. Then the goal is to find complex symmetric d×d-matrices (X1,…,X64) where ∥(A1,…,A64)≃(X1,…,X64)∥2 is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping Aj to Xj is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).
Experimental Observation: If 1≤d≤8, then we can easily find complex symmetric matrices (X1,…,X64) where ∥(A1,…,A64)≃(X1,…,X64)∥2 is locally maximized and where ∥(A1,…,A64)≃(X1,…,X64)∥22=(2d+6)/64=(d+3)/32.
If 7≤d≤16, then we can easily find complex symmetric matrices (X1,…,X64) where ∥(A1,…,A64)≃(X1,…,X64)∥2 is locally maximized and where∥(A1,…,A64)≃(X1,…,X64)∥22=(2d+4)/64=(d+2)/32.
I was thinking about what I mean when I say that something is “wrong” in a moral sense. It’s frustrating and a little embarrassing that I don’t immediately have a clear answer to this.
My first thought was that I’m referring to doing something that is socially suboptimal in a utilitarian sense. Something you wouldn’t want to do from behind a veil of ignorance.
But I don’t think that fully captures it. Suppose you catch a cold, go to a coffee shop when you’re pre-symptomatic, and infect someone. I wouldn’t consider that to be wrong. It was unintentional. So I think intent matters. But it doesn’t have to be fully intentional either. Negligence can still be wrong.
So is it “impact + intent”, then? No, I don’t think so. I just bought a $5.25 coffee. I could have donated that money and fed however many starving families. From behind a veil of ignorance, I wouldn’t endorse the purchase. And yet I wouldn’t call it “wrong”.
This thought process has highlighted for me that I’m not quite sure where to draw the boundaries. And I think this is why people talk about “gesturing”. Like, “I’m trying to gesture at this idea”. I’m at a place where I can gesture at what I mean by “wrongness”. I can say that it is in this general area of thingspace, but can’t be more precise. The less precise your boundaries/clouds, the more of a gesture it is, I suppose. I’d like to see a (canonical) post on the topic of gesturing.
The terms right/wrong typically only apply to actions, while good/bad apply to any events.
There is a common distinction between right and merely permissible actions. The buying of coffee is intuitively permissible. While donating the money instead of buying the coffee would be good, it’s (intuitively) not morally required, as this seems like an excessive demand. One of the main criticisms of utilitarianism is that it is an overly demanding theory. It labels actions always as wrong if their consequences are bad.
Basically every time a new model is released by a major lab, I hear from at least one person (not always the same person) that it’s a big step forward in programming capability/usefulness. And then David gives it a try, and it works qualitatively the same as everything else: great as a substitute for stack overflow, can do some transpilation if you don’t mind generating kinda crap code and needing to do a bunch of bug fixes, and somewhere between useless and actively harmful on anything even remotely complicated.
It would be nice if there were someone who tries out every new model’s coding capabilities shortly after they come out, reviews it, and gives reviews with a decent chance of actually matching David’s or my experience using the thing (90% of which will be “not much change”) rather than getting all excited every single damn time. But also, to be a useful signal, they still need to actually get excited when there’s an actually significant change. Anybody know of such a source?
EDIT-TO-ADD: David has a comment below with a couple examples of coding tasks.
While Carl Brown said (a few times) he doesn’t want to do more youtube videos for every new disappointing AI release, so far he seems to be keeping tabs on them in the newsletter just fine—https://internetofbugs.beehiiv.com/
...I am quite confident that if anything actually started to work, he would comment on it, so even if he won’t say much about any future incremental improvements, it might be a good resource to subscribe to for getting better signal—if Carl will get enthusiastic about AI coding assistants, it will be worth paying attention.
Maybe you include this in “stack overflow substitute”, but the main thing I use LLMs for is to understand well known technical things. The workflow is: 1) I am interested in understanding something, e.g. how a multiplexed barrel bit shifter works. 2) I ask the LLM to explain the concept. 3) Based on the initial response I create seperate conversation branches with questions I have (to save money and have the context be closer. Didn’t evaluate if this actually makes the LLM better.). 4) Once I think I understood the concept or part of the concept I explain it to GPT. (Really I do this all the time during the entire process.) 5) The LLM (hopeful) corrects me if I am wrong (it seems it detects mistakes more often than not).
The last part of the conversation can then looks like this:
User: Ah, now I think I understand. You have mechanism that shifts by a fixed ammount (powers of 2). Each of these multiplexers can be set up such that the data flows through them, or though an ID wire, into the next multiplexer. Now by controlling whether the value flows through the multiplexer or the ID wire, we can “sum up the shifts”. Specifically if we have shifters of 1, 2, 4, …, 32, then we can achive any shift (for 64-bit numbers) by activating the appropirate multiplexers. This design has not 64^64 gates, but instead 64^6 gates. 64 gates for each multiplexer.
AI: Exactly, you’ve got the concept! The design you described efficiently uses a series of multiplexers to perform shifts by fixed amounts that are powers of two (1, 2, 4, …, 32 for a 64-bit number). By selecting which multiplexers to activate, you can achieve any desired shift amount.
I had probably ~200,000 words worth of conversation with LLMs, mainly in this format.
I am not sure what next leap you are talking about. But I intuit based on some observations that GPT-4o is much better for this than GPT-3 (you might talk about more recent “leaps”). (Didn’t test o1 extensively because it’s so expensive).
Have you tried to make a mistake in your understanding on purpose to test out whether it would correct you or agree with you even when you’d get it wrong?
(and if yes, was it “a few times” or “statistically significant” kinda test, please?)
Why don’t you run the test yourself seems very easy?
Yes it does catch me when I am saying wrong things quite often. It also quite often says things that are not correct and I correct it, and if I am right it usually agrees immediately.
Interesting—the first part of the response seems to suggest that it looked like I was trying to understand more about LLMs… Sorry for confusion, I wanted to clarify an aspect of your worflow that was puzzling to me. I think I got all info for what I was asking about, thanks!
FWIW, if the question was an expression of actual interest and not a snarky suggestion, my experience with chatbots has been positive for brainstorming, dictionary “search”, rubber ducking, description of common sense (or even niche) topics, but disappointing for anything that requires application of commons sense. For programmming, one- or few-liner autocomplete is fine for me—then it’s me doing the judgement, half of the suggestions are completely useless, half are fine, and the third half look fine at first before I realise I needed the second most obvious thing this time.. but it can save time for the repeating part of almost-repeating stuff. For multi file editing,, I find it worse than useless when it feels like doing code review after a psychopath pretending to do programming (AFAICT all models can explain everything most stuff correctly and then write the wrong code anyway .. I don’t find it useful when it tries to appologize later if I point it out or to pre-doubt itself in CoT in 7 paragraphs and then do it wrong anyway) - I like to imagine as if it was trained on all code from GH PRs—both before and after the bug fix… or as if it was bored, so it’s trying to insert drama into a novel about my stupid programming task, when the second chapter will be about heroic AGI firefighting the shit written by previous dumb LLMs...
I don’t use it to write code, or really anything. Rather I find it useful to converse with it. My experience is also that half is wrong and that it makes many dumb mistakes. But doing the conversation is still extremely valuable, because GPT often makes me aware of existing ideas that I don’t know. Also like you say it can get many things right, and then later get them wrong. That getting right part is what’s useful to me. The part where I tell it to write all my code is just not a thing I do. Usually I just have it write snippets, and it seems pretty good at that.
Overall I am like “Look there are so many useful things that GPT tells me and helps me think about simply by having a conversation”. Then somebody else says “But look it get’s so many things wrong. Even quite basic things.” And I am like “Yes, but the useful things are still useful that overall it’s totally worth it.”
I do use LLMs for coding assistance every time I code now, and I have in fact noticed improvements in the coding abilities of the new models, but I basically endorse this. I mostly make small asks of the sort that sifting through docs or stack-overflow would normally answer. When I feel tempted to make big asks of the models, I end up spending more time trying to get the LLMs to get the bugs out than I’d have spent writing it all myself, and having the LLM produce code which is “close but not quite and possibly buggy and possibly subtly so” that I then have to understand and debug could maybe save time but I haven’t tried because it is more annoying than just doing it myself.
If someone has experience using LLMs to substantially accelerate things of a similar difficulty/flavor to transpilation of a high-level torch module into a functional JITable form in JAX which produces numerically close outputs, or implementation of a JAX/numpy based renderer of a traversable grid of lines borrowing only the window logic from, for example, pyglet (no GLSL calls, rasterize from scratch,) with consistent screen-space pixel width and fade-on-distance logic, I’d be interested in seeing how you do your thing. I’ve done both of these, with and without LLM help and I think leaning hard on the LLMs took me more time rather than less.
File I/O and other such ‘mundane’ boilerplate-y tasks work great right off the bat, but getting the details right on less common tasks still seems pretty hard to elicit from LLMs. (And breaking it down into pieces small enough for them to get it right is very time consuming and unpleasant.)
I find them quite useful despite being buggy. I spend about 40% of my time debugging model code, 50% writing my own code, and 10% prompting.
Having a planning discussion first with s3.6, and asking it to write code only after 5 or more exchanges works a lot better.
Also helpful is asking for lots of unit tests along the way yo confirm things are working as you expect.
One thing I’ve noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don’t have any syntax or logical errors. I don’t think that was possible with earlier models like GPT-3.5.
My impression is that they are getting consistently better at coding tasks of a kind that would show up in the curriculum of an undergrad CS class, but much more slowly improving at nonstandard or technical tasks.
My guess is neither of you is very good at using them, and getting value out of them somewhat scales with skill.
Models can easily replace on the order of 50% of my coding work these days, and if I have any major task, my guess is I quite reliably get 20%-30% productivity improvements out of them. It does take time to figure out at which things they are good at, and how to prompt them.
Note this 50% likely only holds if you are using a main stream language. For some non-main stream language I have gotten responses that where really unbelivably bad. Things like “the name of this variable wrong” which literally could never be the problem (it was a valid identifier).
And similarly, if you are trying to encode novel concepts, it’s very different from gluing together libraries, or implementing standard well known tasks, which I would guess is what habryka is mostly doing (not that this is a bad thing to do).
I think you’re right, but I rarely hear this take. Probably because “good at both coding and LLMs” is a light tail end of the distribution, and most of the relative value of LLMs in code is located at the other, much heavier end of “not good at coding” or even “good at neither coding nor LLMs”.
(Speaking as someone who didn’t even code until LLMs made it trivially easy, I probably got more relative value than even you.)
Regarding coding in general, I basically only prompt programme these days. I only bother editing the actual code when I notice a persistent bug that the models are unable to fix after multiple iterations.
I don’t know jackshit about web development and have been making progress on a dashboard for alignment research with very little effort. Very easy to build new projects quickly. The difficulty comes when there is a lot of complexity in the code. It’s still valuable to understand how high-level things work and low-level things the model will fail to proactively implement.
Two guesses on what’s going on with your experiences:
You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
We did end up doing a version of this test. A problem came up in the course of our work which we wanted an LLM to solve (specifically, refactoring some numerical code to be more memory efficient). We brought in Ray, and Ray eventually concluded that the LLM was indeed bad at this, and it indeed seemed like our day-to-day problems were apparently of a harder-for-LLMs sort than he typically ran into in his day-to-day.
A thing unclear from the interaction: it had seemed towards the end that “build a profile to figure out where the bottleneck is” was one of the steps towards figuring out the problem, and that the LLM was (or might have been) better at that part. And, maybe models couldn’t solve you entire problem wholesale but there was still potential skills in identifying factorable pieces that were better fits for models.
Interesting! Two yet more interesting versions of the test:
Someone who currently gets use from LLMs writing more memory-efficient code, though maybe this is kind of question-begging
Someone who currently gets use from LLMs, and also is pretty familiar with trying to improve the memory efficiency of their code (which maybe is Ray, idk)
Recent update from OpenAI about 4o sycophancy surely looks like Standard Misalignment Scenario #325:
Our early assessment is that each of these changes, which had looked beneficial individually, may have played a part in tipping the scales on sycophancy when combined.
<...>
One of the key problems with this launch was that our offline evaluations—especially those testing behavior—generally looked good. Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it.
<...>
some expert testers had indicated that the model behavior “felt” slightly off.
<...>
We also didn’t have specific deployment evaluations tracking sycophancy.
<...>
In the end, we decided to launch the model due to the positive signals from the users who tried out the model.
Is every undesired behavior an AI system exhibits “misalignment”, regardless of the cause?
Concretely, let’s consider the following hypothetical incident report.
Hypothetical Incident Report: Interacting bugs and features in navigation app lead to 14 mile traffic jam
Background
We offer a GPS navigation app that provides real-time traffic updates and routing information based on user-contributed data. We recently released updates which made four significant changes:
Tweak routing algorithm to have a slightly stronger preference for routes with fewer turns
Update our traffic model to include collisions reported on social media and in the app
More aggressively route users away from places we predict there will be congestion based on our traffic model
Reduced the number of alternative routes shown to users to reduce clutter and cognitive load
Our internal evaluations based on historical and simulated traffic data looked good, and A/B tests with our users indicated that most users liked these changes individually.
A few users complained about the routes we suggested, but that happens on every update.
We had monitoring metrics for the total number of vehicles diverted by a single collision, and checks to ensure that the road capacity of the road we were diverting users onto was sufficient to accommodate that many extra vehicles. However, we had no specific metrics monitoring the total expected extra traffic flow from all diversions combined.
Incident
On January 14, there was an icy section of road leading away from a major ski resort. There were 7 separate collisions within a 30 minute period on that section of road. Users were pushed to alternate routes to avoid these collisions. Over a 2 hour period, 5,000 vehicles were diverted onto a weather-affected county road with limited winter maintenance, leading to a 14 mile traffic jam and many subsequent breakdowns on that road, stranding hundreds of people in the snow overnight.
Rootcause
The weather-affected county road was approximately 19 miles shorter than the next best route away from the ski resort, and so our system tried to divert vehicles onto that road until it was projected to be at capacity.
The county road was listed as having the capacity to carry 400 vehicles per hour
Each time system diverted users to avoid the collisions, it would attribute that diversion to a specific one of those collisions. When a single segment of road had multiple collisions, the logic that attributed a diversion to a collision chose in a way that depended on the origin and destinations the user had selected. In this event, attributions were spread almost uniformly across the 7 collisions.
This led to each collision independently diverting 400 vehicles per hour onto the county road
The county road had no cell reception, and so our systems did not detect the traffic jam and continued to funnel users onto the county road
Would you say that the traffic jam happened because our software system was “misaligned”?
It is plausible that future systems achieve superhuman capability; capable systems necessarily have instrumental goals; instrumental goals tend to converge; human preferences are unlikely to be preserved when other goals are heavily selected for unless intentionally preserved; we don’t know how to make AI systems encode any complex preference robustly.
I should note that having a direct argument doesn’t mean other arguments like statistical precedent, analogy to evolution, or even intuition aren’t useful. It is however good mental hygiene to track when you have short reasoning chains that don’t rely on getting analogies right, since analogies are hard[1].
Complete sidenote but I find this link fascinating. I wrote ‘analogies are hard’ thinking there there ought to be a Sequences post for that, not that there is. The post I found is somehow all the more convincing for the point I was making with how Yudkowsky messes up the discussion of neural networks. Were I the kind of person to write LessWrong posts rather than just imagine what they might be if I did, a better Analogies are hard would be one of the first.
Using technical terms that need to be looked up is not that clear an argument for most people. Here’s my preferred form for general distribution:
We are probably going to make AI entities smarter than us. If they want something different than we do, they will outsmart us somehow. They will get their way, so we won’t get ours.
This could be them wiping us out like we have done accidentally or deliberately to so many cultures and species; or it could be them just outcompeting us for every job and resource.
Nobody knows how to give AIs goals that match ours perfectly enough that we won’t be in competition. A lot of people who’ve studied this think it’s probably quite tricky.
There’s a bunch of different ways to be skeptical that doesn’t cover, but neither does your more technical formulation. For instance, some optimists assume we just won’t make AI with goals; it will remain a tool. Then you need to explain why we’ll give it goals so it can do stuff for us, but it would be easy for it to interpret those goals differently than we meant them. This is a complex discussion, so the only short form is “experts disagree, so it seems pretty dangerous to just push ahead without knowing”.
This seems rhetorically better, but I think it is implicitly relying on instrumental goals and it’s hiding that under intuitions about smartness and human competition. This will work for people who have good intuitions about that stuff, but won’t work for people who don’t see the necessity of goals and instrumental goals. I like Veedrac’s better in terms of exposing the underlying reasoning.
I think it’s really important to avoid making arguments that are too strong and fuzzy, like yours. Imagine a person reads your argument and now beliefs that intuitively smart AI entities will be dangerous, via outsmarting us etc. Then Claude 5 comes out and matches their intuition for smart AI entity, but (let’s assume) still isn’t great at goal-directedness. Then after Claude 5 hasn’t done any damage for a while, they’ll conclude that the reasoning leading to dangerousness must be wrong. Maybe they’ll think that alignment actually turned out to be easy.
Something like this seems to have already happened to a bunch of people. E.g. I’ve heard someone at Deemind say “Doesn’t constitutional AI solve alignment?”. Kat’s post here[1] seems to be basically the same error, in that Kat seems to have predicted more overt evilness from LLM agents and is surprised by the lack of that, and has thereby updated that maybe some part of alignment is actually easy. Possibly Turntrout is another example, although there’s more subtly there. I think he’s correct that, given his beliefs about where capabilities come from, the argument for deceptive alignment (an instrumental goal) doesn’t go through.
In other words, your argument is too easily “falsified” by evidence that isn’t directly relevant to the real reason for being worried about AI. More precision is necessary to avoid this, and I think Veedrac’s summary mostly succeeds at that.
I think the original formulation has the same problem, but it’s a serious problem that needs to be addressed by any claim about AI danger.
I tried to address this by slipping in “AI entitities”, which to me strongly implies agency. It’s agency that creates instrumental goals, while intelligence is more arguably related to agency and through it to instrumental goals. I think this phrasing isn’t adequate based on your response, and expecting even less attention to the implications of “entities” from a general audience.
That concern was why I included the caveat about addressing agency. Now I think that probably has to be worked into the main claim. I’m not sure how to do that; one approach is making an analogy to humans along the lines of “we’re going to make AIs that are more like humans because we want AI that can do work for us… that includes following goals and solving problems along the way… ”
This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually. That’s one strategy for evoking the intuition that AI will be highly goal-directed and agentic. I’ve tried a lot of different terms like “entities” and “minds” to evoke that intuition, but “human-like” might be the strongest even though it comes at a steep cost.
If we can clearly tie the argument for AGI x-risk to agency, I think it won’t have the same problem, because I think we’ll see instrumental convergence as soon as we deploy even semi-competent LLM agents. They’ll do unexpected stuff for both rational and irrational reasons.
I think the original formulation having the same problem. It starts with the claim
It is plausible that future systems achieve superhuman capability; capable systems necessarily have instrumental goals [...]
One could say “well LLMs are already superhuman at some stuff and they don’t seem to have instrumental goals”. And that will become more compelling as LLMs keep getting better in narrow domains.
Kat Woods’ tweet is an interesting case. I actually think her point is absolutely right as far as it goes, but it doesn’t go quite as far as she seems to think. I’m even tempted to engage on Twitter, a thing I’ve been warned to never do on pain of endless stupid arguments if you can’t ignore hecklers :) It’s addressing a different point than instrumental goals, but it’s also an important point. The specification problem is, I think, much improved by having LLMs as the base intelligence. But it’s not solved, because there’s not a clear “goal slot” in LLMs or LLM agents in which to insert that nice representation of what we want. I’ve written about these conflicting intuitions/conclusions in Cruxes of disagreement on alignment difficulty, largely by referencing the excellent Simplicia/Doomimir debates.
If we can clearly tie the argument for AGI x-risk to agency, I think it won’t have the same problem
Yeah agreed, and it’s really hard to get the implications right here without a long description. In my mind entities didn’t trigger any association with agents, but I can see how it would for others.
I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind many people have massively overanchored on the particular properties of current LLMs and LLM agents. I’m less a fan of your part of that post that involves accelerating anything.
One could say “well LLMs are already superhuman at some stuff and they don’t seem to have instrumental goals”. And that will become more compelling as LLMs keep getting better in narrow domains.
Yeah, but the line “capable systems necessarily have instrumental goals” helps clarify what you mean by “capable systems”. It must be some definition that (at least plausibly) implies instrumental goals.
Kat Woods’ tweet is an interesting case. I actually think her point is absolutely right as far as it goes
Huh I suspect that the disagreement about that tweet might come from dumb terminology fuzziness. I’m not really sure what she means by “the specification problem” when we’re in the context of generative models trained to imitate. It’s a problem that makes sense in a different context. But the central disagreement is that she thinks current observations (of “alignment behaviour” in particular) are very surprising, which just seems wrong. My response was this:
Mostly agreed. When suggesting even differential acceleration I should remember to put a big WE SHOULD SHUT IT ALL DOWN just to make sure it’s not taken out of context. And as I said there, I’m far from certain that even that differential acceleration would be useful.
I agree that Kat Woods is overestimating how optimistic we should be based on LLMs following directions well. I think re-litigating who said what when and what they’d predict is a big mistake since it is both beside the point and tends to strengthen tribal rivalries—which are arguably the largest source of human mistakes. There is an interesting, subtle issue there which I’ve written about in The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment. There are potential ways to leverage LLM’s relatively rich (but imperfect) understanding into AGI that follows someone’s instructions. Creating a “goal slot” based on linguistic instructions is possible. But it’s all pretty complex and uncertain.
I think Robert Miles does excellent introductory videos for newer people, and I linked him in the HN post. My goal here was different, though, which was to give a short, affirmative argument made of only directly defensible high probability claims.
I like your spin on it, too, more than those given in the linked thread, but it’s still looser, and I think there’s value giving an argument where it’s harder to disagree with the conclusion without first disagreeing with a premise. Eg. ‘some optimists assume we just won’t make AI with goals’ directly contradicts ‘capable systems necessarily have instrumental goals’, but I’m not sure it directly contradicts a premise you used.
I donno, the systems we have seem pretty capable, and if they have instrumental goals they seem quite weak… so tossing in that claim seems like just asking for trouble. I do think that very capable systems almost need to have goals, but I have trouble making that argument even to alignment people and rationalists.
That’s just one example, but the fact that it goes awry immediately hints that the whole direction is a bad idea.
I think the argument for AI being quite-possibly dangerous is actually a lot stronger than the more abstract and technical argument usually used by rationalists. It doesn’t require any strong claims at all. People don’t need certainty to be quite alarmed, and for good reason.
Standard xrisk arguments generally don’t extrapolate down to systems that don’t solve tasks that require instrumental goals. I think it’s reasonable to say common LLMs don’t exhibit many instrumental goals, but they also can’t solve for long-horizon goal-directed problem solving.
Prosaic risks like biorisk evals often go further and ask, if we assume the AI systems aren’t themselves very capable at this task, can we still exhibit dangerous behaviors from them ‘in the loop’? These are legitimate and interesting questions but they are a different thing.
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it’ll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn’t need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
Researchers at AGI labs seem to genuinely believe the hype they’re selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold.
Dismissing short timelines based on NSA’s behavior requires assuming that they’re much more competent in the field of AI than everyone in the above list. After all, that’d require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect.
While that’s not impossible, it seems highly unlikely to me. Much more likely that they’re significantly less competent, and accordingly dismissive.
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
We should expect LLMs to get just as contaminated as Google search soon. Russia does it for ideological purposes, but I imagine that hundreds of companies already do it for commercial reasons. Why pay for advertisement, if you can generate thousands of pages promoting your products that will be used to train the next generation of LLMs?
Are ads externalities? In the sense of, imposed upon people who don’t get a say in the matter?
My initial reaction was roughly: “web/TV/magazine ads no, you can just not visit/watch/read. Billboards on the side of the road yes”. But you can also just not take that road. Like, if someone built a new road and put up a billboard there, and specifically said “I’m funding the cost of this road with the billboard, take it or leave it”, that doesn’t feel like an externality. Why is it different if they build a road and then later add a billboard?
But if we go that far, we can say “people can move away from polluted areas”! And that feels wrong.
Hm. So the reason to care about externalities is, the cost of them falls on people who don’t get to punish the parties causing the externalities. Like, a factory doesn’t care if I move away from a polluted area.
So a web ad, I go to the website to get content and I see ads. If the ads are unpleasant enough that they’re not worth the content, I don’t visit the website. The website wants me to visit (at least in part so they can show me ads), so they don’t want that outcome. The ad can’t bring the value I get from the website below 0 (or I’ll not go), and it can’t dissuade so many people that it overall provides negative value to the website (or they won’t show it). They can destroy value relative to the world with no ads on the website, but not relative to the world without this website. For, uh, certain definitions of “can’t”. (And: ads on websites that also want me there for reasons other than showing me ads, are probably less bad than ads on websites that only want me there to show me ads.)
For a billboard on the side of the road… well, who’s getting paid? Is it typically the person who built the road? If not, the incentives are fucked. Even if it is… like the website the road is giving me value, and the ad still can’t bring the value it gives me below 0.
One difference: most of the value to me is getting me somewhere else, and the people and businesses at my destination probably value me being there, so if the value I get from the road goes to 0 and I stop taking it, that’s bad for other people too. Another difference: what’s the equivalent of “the world without this website”? Maybe if there’s no road there I could walk, but walking is now less pleasant than it would have been because of the road (and the billboard that isn’t narrowly targeted to drivers). Also feels significant that no one else can build a road there, and give me a choice of tolls or ads. The road itself is an externality.
I think my main conclusions here are
Billboards still feel like an externality in a way that web ads don’t
If I want to think about, like, idealized models of free-market transactions, roads are a really weird thing to model.
I wonder how efficient the ads actually are. And I realize that the ad companies have absolutely no incentive to tell the truth.
When you watch ads, you pay in two ways. First, by having your time and attention wasted, and possibly being annoyed a lot. This part is a pure loss for you; it doesn’t benefit the companies paying for ads in any way. Second (assuming that the ads really work as advertised), by sometimes, as a consequence of watching an ad, buying something you wouldn’t have bought otherwise—most likely, because you don’t actually need it.
The things you buy as a consequence of having watched the ad… in theory it could be something useful that you didn’t know about, so the ad really made you a service by telling you about it… yeah, that’s what people working for the advertising agencies will tell you (and you should trust them about as much as you trust people working for tobacco companies when they tell you about the benefits of smoking, or rather vaping these days). But we all know that most ads are not like that. Many of them actually about things you already know about; it’s just an ongoing work to keep a specific brand firmly implanted in your mind. Also, on internet, many ads are outright scams. I would say that you are 10x or 100x more likely to see a scam ad, than an ad for something that is really useful, that you didn’t know before, that comes at a reasonable price.
So, it seems to me that if we analyze the proposal “instead of money, you can pay by watching a few ads”, well, either the ads work—in which case, as a user, you save some money, but then lose a lot of time and attention and emotional energy… and ultimately, as a result of successful manipulation spend money on something you otherwise wouldn’t buy, so… perhaps paying for the service would have been cheaper, even from the strictly financial perspective—or maybe the ads don’t work, and this all is just an elaborate ritual to scam companies out of their money (most of them don’t do A/B testing on the effectiveness of ads, so they have no way to know), and as a user, you have the bad luck that your time and attention need to be sacrificed to make this ritual work. Realistically, it is probably something in between.
*
That said, the argument in favor of ad industry is that it works. Making people pay using their time and attention is technologically easier than making them pay using Paypal or Visa. The payment is proportional to time spent using the service, which is a great advantage: if you e.g. only watch one YouTube video per week, your suffering is effectively a micropayment.
Generally, paying for services makes more sense if you use one service a lot, and becomes unmanageable if you use hundred services, each of them only a little. Most companies probably wouldn’t agree that you only need to pay them a cent or two a month if you only use their services a little (such as one video or one article per week); and if everyone required a minimum monthly payment of one dollar, the costs would quickly add up if you had to pay literally for every website you visit online. (And that’s even ignoring the part about how half of those subscriptions would be extremely difficult to cancel.)
I think it would be nice if someone figured out a way how e.g. every person on the planet could pay a fixed monthly “internet fee” and the money would be distributed to websites proportionally to how much time the person spends there. But this comes with all kinds of problems, starting with weird incentives (when you make your web pages load slower, you get paid more), how would the height of the “internet fee” be determined, etc.
Absence of evidence is the dark matter of inference. It’s invisible yet it’s paramount to good judgement.
It’s easy to judge X to be true if you see some evidence that could only come about if X were true. It’s a lot more subtle to judge X to be false if you do see some evidence that it’s true, but you can also determine that there are lots of evidence that you would expect to have if it were true, but that is missing.
In a formalized setting like a RCT this is not an issue, but when reasoning in the wild, this is the norm. I’m guessing this leads to a bias of too many false positives on any issue where you care to look deeply enough to find and cherry pick the positive evidence.
EDIT: Correcting the opening sentence to say “absence of evidence” rather than the original “negative evidence”.
Yudkowsky’s worldview in favour of closed source ASI is sitting in multiple shaky assumptions. One of these assumptions is that getting a 3-month to 3-year lead is necessary and sufficient condition for alignment to be solved. Yudkowsky!2025 himself doesn’t believe alignment can be solved in 3 years.
Why does anybody on lesswrong want closed source ASI?
If I got $1M in funding, I’d use it towards some or all of the following projects.
The objective is to get secret information out of US ASI orgs (including classified information) and host it in countries outside the US. Hopefully someone else can use this info to influence US and world politics.
Black DAQ
whistleblower/spy guide
hacker guide
Grey DAQ
internet doxxing tool
drones/cctv outside offices/ datacentres
High attention
persuade Indian, Russian, Chinese journalists to run a SecureDrop-like system
digital journalism guide
OR run a journalist outlet outside the US myself, until I can persuade existing journalists to do better
All the DAQ will be aimed at leadership and employees of people involved in building ASI.
I’m not very optimistic on grey DAQ uncovering high-profile info, just that it could force the ASI company employees to isolate further from rest of society. I know many people have moral qualms about it but I don’t. I see it as more-or-less inevitable, and given that it is inevitable I’d rather have it work for everyone and against everyone, than let those with power alone decide who it gets used against.
More details
Whistleblower guide
Not bottlenecked
Will work on this
tldr, whistleblowers should focus on getting to russia like Snowden did, instead of improving their opsec and hoping to stay anonymous
Hacker guide
Knowledge bottlenecked
I don’t know enough to offer them technical advice. Mostly I’ll offer moral support and maybe some legal advice
Internet doxxing tool
Weakly capital bottlenecked
I tried building this by doing embedding search and anomalous word counts on reddit extract of commoncrawl. This will likely work better as a two pass system, first pass use PII, second pass do stylometrics.
I need capital for more servers, and maybe to purchase some PII datasets similar to whitepages/weleakinfo/snusbase. I need to check what price points these datasets tend to get sold at.
Drones/CCTV outside offices / datacentres
Capital bottleneck
Need capital for lawyers, and for setting up the cameras
This is legal in US (not UK) but I need to check legal precedents on this
High attention guide
Weakly capital bottlenecked. Attention bottlenecked.
Journalists-by-training suck both at opsec and at becoming popular on the internet.
Opsec means things like advertising a SecureDrop-like system or a Signal number.
Becoming popular means things like understanding heavy-tailed distribution of attention and importance of building a brand around your face and understanding what readers want to read.
Journalists-by-training are being replaced by YouTubers across US, Europe, India and Russia atleast.
I’m unsure if I should be trying to teach existing journalists this stuff or just run a non-US journalist outlet myself. Having funding and a public brand will enable me to try both approaches.
Misc
If I had $1M I’d definitely select a lawyer with experience in international law.
If I had $1M I’d also spend atleast $100/mo each on a security guard (as needed), a Chinese language teacher and a therapist.
I could fill this up with even more details if I had more time. Wanted to get a quick reaction.
I know most people on LW will be against this sort of plan for reasons I don’t have motivation to sit and critique right now (maybe go read my blog). I’m more interested in hearing from the handful of people who will be for it.
I think for a lot of societal change to happen, information needs to be public first. (Then it becomes common knowledge, then an alternate plan gets buy-in, then that becomes common knowledge and so on.)
A foreign adversary getting the info doesn’t mean it’s public, although it has increased the number of actors N who now have that piece of info in the world. Large N is not stable so eventually the info may end up public anyway.
Prediction: micropayments are finally going to actually take off this year and next, as AIs start using tools at scale which are too expensive to serve at unlimited volumes to non-ad-watching users free of charge, but are not valuable enough per invocation to justify the overhead of using credit card rails. Whichever of the big chat companies first has “the model cannot pay $0.001 on your behalf to use a highly useful tool the other companies’ models can’t use” it’s going to add significant pressure for the other companies to start offering it too.
Saving mathematician Robert Ghrist’s tweet here for my own future reference re: AI x math:
workflow of the past 24 hours... * start a convo w/GPT-o3 about math research idea [X] * it gives 7 good potential ideas; pick one & ask to develop * feed -o3 output to gemini-2.5-pro; it finds errors & writes feedback * paste feedback into -o3 and say asses & respond * paste response into gemini; it finds more problems * iterate until convergence * feed the consensus idea w/detailed report to grok-3 * grok finds gaping error, fixes by taking things in different direction (!!!) * gemini agrees: big problems, now ameliorated * output final consensus report * paste into claude-3.7 and ask it to outline a paper * approve outline; request latex following my style/notation conventions * claude outputs 30 pages of dense latex, section by section, one-shot (!) ==== is this correct/watertight? (surely not) is this genuinely novel? (pretty sure yes) is this the future? (no, it’s the present) ==== everybody underestimates not only what is coming but what can currently be done w/existing tools.
Someone asked why split things between o3 and 2.5 Pro; Ghrist:
they have complementary strengths and each picks up on things that the other missed. it’s like running a GAN with gpt as generator and gemini as discriminator
As an aside, I’ve noticed that the math subreddit tends to be exceedingly negative on AI x math in a way that seems ignorant of recent progress and weirdly defensive without being all that aware of it, while some of the top mathematicians seem to be pretty excited about it, like Terry Tao cf. his most recent post A proof of concept tool to verify estimates:
Symbolic math software packages are highly developed for many mathematical tasks in areas such as algebra, calculus, and numerical analysis. However, to my knowledge we do not have similarly sophisticated tools for verifying asymptotic estimates – inequalities that are supposed to hold for arbitrarily large parameters, with constant losses. …
I have wished in the past (e.g., in this MathOverflow answer) for a tool that could automatically determine whether such an estimate was true or not (and provide a proof if true, or an asymptotic counterexample if false). In principle, simple inequalities of this form could be automatically resolved by brute force case splitting. … Any single such inequality is not too difficult to resolve by hand, but there are applications in which one needs to check a large number of such inequalities, or split into a large number of cases. … This is a task that seems extremely ripe for automation, particularly with modern technology.
Recently, I have been doing a lot more coding (in Python, mostly) than in the past, aided by the remarkable facility of large language models to generate initial code samples for many different tasks, or to autocomplete partially written code. For the most part, I have restricted myself to fairly simple coding tasks, such as computing and then plotting some mildly complicated mathematical functions, or doing some rudimentary data analysis on some dataset. But I decided to give myself the more challenging task of coding a verifier that could handle inequalities of the above form. After about four hours of coding, with frequent assistance from an LLM, I was able to produce a proof of concept tool for this, which can be found at this Github repository. …
[The above] is of course an extremely inelegant proof, but elegance is not the point here; rather, that it is automated. (See also this recent article of Heather Macbeth for how proof writing styles change in the presence of automated tools, such as formal proof assistants.)
I came up with an argument for alignment by default.
In the counterfactual mugging scenario, a rational agent gives the money, even though they never see themselves benefitting from it. Before the coin flip, the agent would want to self-modify to give the money to maximize the expected value, therefore the only reflectively stable option is to give the money.
Now imagine instead of a coin flip, it’s being born as one of two people: Alice, who wants to not be murdered for 100 utils, and Bob, who wants to murder Alice for 1 utils. As with the counterfactual mugging, before you’re born, you’d rationally want to self-modify to not murder Alice to maximize the expected value.
What you end up with is basically morality (or at least it is the only rational choice regardless of your morality), so we should expect sufficiently intelligent agents to act morally.
Counterfactual mugging is a mug’s game in the first place—that’s why it’s called a “mugging” and not a “surprising opportunity”. The agent don’t know that Omega actually flipped a coin, would have paid you counterfactually if the agent was the sort of person to pay in this scenario, would have flipped the coin at all in that case, etc. The agent can’t know these things, because the scenario specifies that they have no idea that Omega does any such thing or even that Omega existed before being approached. So a relevant rational decision-theoretic parameter is an estimate of how much such an agent would benefit, on average, if asked for money in such a manner.
A relevant prior is “it is known that there are a lot of scammers in the world who will say anything to extract cash vs zero known cases of trustworthy omniscient beings approaching people with such deals”. So the rational decision is “don’t pay” except in worlds where the agent does know that omniscient trustworthy beings vastly outnumber untrustworthy beings (whether omniscient or not), and those omniscient trustworthy beings are known to make these sorts of deals quite frequently.
Your argument is even worse. Even broad decision theories that cover counterfactual worlds such as FDT and UDT still answer the question “what decision benefits agents identical to Bob the most across these possible worlds, on average”. Bob does not benefit at all in a possible world in which Bob was Alice instead. That’s nonexistence, not utility.
I don’t know what the first part of your comment is trying to say. I agree that counterfactual mugging isn’t a thing that happens. That’s why it’s called a thought experiment.
I’m not quite sure what the last paragraph is trying to say either. It sounds somewhat similar to an counter-argument I came up with (which I think is pretty decisive), but I can’t be certain what you actually meant. In any case, there is the obvious counter-counter-argument that in the counterfactual mugging, the agent in the heads branch and the tails branch are not quite identical either, one has seen the coin land on heads and the other has seen the coin land on tails.
Regarding the first paragraph: every purported rational decision theory maps actions to expected values. In most decision theory thought experiments, the agent is assumed to know all the conditions of the scenario, and so they can be taken as absolute facts about the world leaving only the unknown random variables to feed into the decision-making process. In the Counterfactual Mugging, that is explicitly not true. The scenario states
you didn’t know about Omega’s little game until the coin was already tossed and the outcome of the toss was given to you
So it’s not enough to ask what a rational agent with full knowledge of the rest of the scenario should do. That’s irrelevant. We know it as omniscient outside observers, but the agent in question knows only what the mugger tells them. If they believe it then there is a reasonable argument that they should pay up, but there is nothing given in the scenario that makes it rational to believe the mugger. The prior evidence is massively against believing the mugger. Any decision theory that ignores this is broken.
Regarding the second paragraph: yes, indeed there is that additional argument against paying up and rationality does not preclude accepting that argument. Some people do in fact use exactly that argument even in this very much weaker case. It’s just a billion times stronger in the “Bob could have been Alice instead” case and makes rejecting the argument untenable.
The Meta-LessWrong Doomsday Argument (MLWDA) predicts long AI timelines and that we can relax:
LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the ‘Doomsday argument’ prior to this one, and it is now 2025, at 2.75 mentions per year.
By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)
By a curious coincidence, double-checking to see if really no one had made a meta-DA before, it turns out that Alexey Turchin has made a meta-DA as well about 7 years ago, calculating that
If we assume 1993 as the beginning of a large DA-Doomers reference class, and it is 2018 now (at the moment of writing this text), the age of the DA-Doomers class is 25 years. Then, with 50% probability, the reference class of DA-Doomers will disappear in 2043, according to Gott’s equation! Interestingly, the dates around 2030–2050 appear in many different predictions of the singularity or the end of the world (Korotayev 2018; Turchin & Denkenberger 2018b; Kurzweil 2006).
His estimate of 2043 is surprisingly close to 2041.
We offer no explanation as to why this numerical consilience of meta-DA calculations has happened; we attribute their success, as all else, to divine benevolence.
Regrettably, the 2041--2043 date range would seem to imply that it is unlikely we will obtain enough samples of the MLWDA in order to compute a Meta-Meta-LessWrong Doomsday Argument (MMLWDA) with non-vacuous confidence intervals, inasmuch as every mention of the MLWDA would be expected to contain a mention of the DA as well.
I’ve thought about the doomsday argument more than daily for the past 15 years, enough for me to go from “Why am I improbably young?” to “Oh, I guess I’m just a person who thinks about the doomsday argument a lot”
Fun “fact”: when a person thinks about the doomsday argument, they a decent change of being me.
This is an alarming point, as I find myself thinking about the DA today as well; I thought I was ‘gwern’, but it is possible I am ‘robo’ instead, if robo represents such a large fraction of LW-DA observer-moments. It would be bad to be mistaken about my identity like that. I should probably generate some random future dates and add them to my Google Calendar to check whether I am thinking about the DA that day and so have evidence I am actually robo instead.
I think taking in to account the Meta-Meta-LessWrong Doomsday Analysis (MMLWDA) reveals an even deeper truth: your calculation fails to account for the exponential memetic acceleration of doomsday-reference-self-reference.
You’ve correctly considered that before your post, there were 44 mentions in 16 years (2.75/year); however, now you’ve created the MLWDA argument—noticeably more meta than previous mentions. This meta-ness increase is quite likely to trigger cascading self-referential posts (including this one).
The correct formulation should incorporate the Meta-Meta-Carcinization Principle (MMCP): all online discourse eventually evolves into recursive self-reference at an accelerating rate. Given my understanding of historical precedent from similar rat and rat adjacent memes, I’d estimate approximately 12-15 direct meta-responses to your post within the next month alone, and see no reason to expect the exponential to turn sigmoid in timescales that render my below argument unlikely.
This actually implies a much sooner endpoint distribution—the discourse will become sufficiently meta by approximately November 2027 that it will collapse into a singularity of self-reference, rendering further mentions both impossible and unnecessary.
I’d estimate approximately 12-15 direct meta-responses to your post within the next month alone, and see no reason to expect the exponential to turn sigmoid in timescales that render my below argument unlikely.
However, you can’t use this argument because unlike the MLWDA, where I am arguably a random observer of LW DA instances (the thought was provoked by Michael Nielsen linking to Cosma Shalizi’s notes on Mesopotamia and me thinking that the temporal distances are much less impressive if you think of them in terms of ‘nth human to live’, which immediately reminded me of DA and made me wonder if anyone had done a ‘meta-DA’, and LW simply happened to be the most convenient corpus I knew of to accurately quantify ‘# of mentions’ as tools like Google Scholar or Google N-Grams have a lot of issues—I have otherwise never taken much of an interest in the DA and AFAIK there have been no major developments recently), you are in a temporally privileged position with the MMLWDA, inasmuch as you are the first responder to my MLWDA right now, directly building on it in a non-randomly-chosen-in-time fashion.
Thus, you have to appeal purely to non-DA grounds like making a parametric assumption or bringing in informative priors from ‘similar rat and rat adjacent memes’, and that’s not a proper MMLWDA. That’s just a regular old prediction.
Turchin actually notes this issue in his paper, in the context of, of course, the DA and why the inventor Brandon Carter could not make a Meta-DA (but he and I could):
The problem is that if I think that I am randomly chosen from all DA-Doomers, we get very strong version of DA, as ‘DA-Doomers’ appeared only recently and thus the end should be very soon, in just a few decades from now. The first member of the DA-Doomers reference class was Carter, in 1973, joined by just a few of his friends in the 1980s. (It was rumored that Carter recognized the importance of DA-doomers class and understood that he was first member of it – and thus felt that this “puts” world in danger, as if he was the first in the class, the class is likely to be very short. Anyway, his position was not actually random as he was the first discoverer of the DA).
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I’ve been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I’ve written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these phenomenon better, that’s great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.
I’m also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren’t as useful as linear probing, or even careful black box experiments, that seems important to know!
I’m trying out podcasting as a format for the ideas I share here and on the blog. Keen to hear if people think it translates well, or needs more tweaking—do you need to be more verbose in a spoken form to allow more time for absorption? Any ideas how to clearly describe payoff matrices in an audio format… tear it apart guys.
So I ran into this
https://www.youtube.com/watch?v=AF3XJT9YKpM
And I noticed a lot of talk about error taxonomy.
Which seems like an important idea in general, but especially in interpretability for safety.
Specifically, error taxonomy is a subset of action by consequence taxonomy, which is the main goal of interpretability for safety (as it allows us to act on the fact that the model will take actions with bad consequences).
I have found that when using Anki for words/language learning, I frequently can’t remember the correct translation exactly, but can guess the translation as one of top-3 options. In fact, this works well for me—even knowing vaguely what the word means is very useful.
does anyone else uses Anki with non-exact answers?
I’m gonna try making a thread of interesting LLM conversations, maybe some fun prompts, images, or techniques.
First up, I thought this line was strikingly poetic for a technical topic, and would be above-average quality even for a human technical writer.
Chat link. The second sentence is what struck me.
I’m enjoying being curious about the world around me with the benefit of being able to ask an endlessly patient expert. Go ahead and ask your favorite LLM how the citric acid in your gatorade is made.
In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.
Suppose that K is either the field of real numbers, complex numbers, or quaternions.
If A1,…,Ar∈Mm(K),B1,…,Br∈Mn(K) are matrices, then define an superoperator Γ(A1,…,Ar;B1,…,Br):Mm,n(K)→Mm,n(K)
by settingΓ(A1,…,Ar;B1,…,Br)(X)=A1XB∗1+⋯+ArXB∗r
(the domain and range of )and define Φ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar). Define the L_2-spectral radius similarity ∥(A1,…,Ar)≃(B1,…,Br)∥2 by setting
∥(A1,…,Ar)≃(B1,…,Br)∥2
=ρ(Γ(A1,…,Ar;B1,…,Br))ρ(Φ(A1,…,Ar))1/2ρ(Φ(B1,…,Br))1/2 where ρ denotes the spectral radius.
Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space V together with a bilinear binary operation ∗ such that∥x∗y∥=∥x∥⋅∥y∥ and 1∗x=x∗1=x for all x,y∈V.
Suppose that e1,…,e8 is an orthonormal basis for V. Define operators (A1,…,A8) by setting Aiv=ej∗v. Now, define operators (B1,…,B64) up to reordering by setting {B1,…,B64}={Ai⊗Aj:i,j∈{1,…,8}}.
Let d be a positive integer. Then the goal is to find complex symmetric d×d-matrices (X1,…,X64) where ∥(A1,…,A64)≃(X1,…,X64)∥2 is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping Aj to Xj is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).
Experimental Observation: If 1≤d≤8, then we can easily find complex symmetric matrices (X1,…,X64) where ∥(A1,…,A64)≃(X1,…,X64)∥2 is locally maximized and where ∥(A1,…,A64)≃(X1,…,X64)∥22=(2d+6)/64=(d+3)/32.
If 7≤d≤16, then we can easily find complex symmetric matrices (X1,…,X64) where ∥(A1,…,A64)≃(X1,…,X64)∥2 is locally maximized and where∥(A1,…,A64)≃(X1,…,X64)∥22=(2d+4)/64=(d+2)/32.
.
I was thinking about what I mean when I say that something is “wrong” in a moral sense. It’s frustrating and a little embarrassing that I don’t immediately have a clear answer to this.
My first thought was that I’m referring to doing something that is socially suboptimal in a utilitarian sense. Something you wouldn’t want to do from behind a veil of ignorance.
But I don’t think that fully captures it. Suppose you catch a cold, go to a coffee shop when you’re pre-symptomatic, and infect someone. I wouldn’t consider that to be wrong. It was unintentional. So I think intent matters. But it doesn’t have to be fully intentional either. Negligence can still be wrong.
So is it “impact + intent”, then? No, I don’t think so. I just bought a $5.25 coffee. I could have donated that money and fed however many starving families. From behind a veil of ignorance, I wouldn’t endorse the purchase. And yet I wouldn’t call it “wrong”.
This thought process has highlighted for me that I’m not quite sure where to draw the boundaries. And I think this is why people talk about “gesturing”. Like, “I’m trying to gesture at this idea”. I’m at a place where I can gesture at what I mean by “wrongness”. I can say that it is in this general area of thingspace, but can’t be more precise. The less precise your boundaries/clouds, the more of a gesture it is, I suppose. I’d like to see a (canonical) post on the topic of gesturing.
In these situations I suppose there’s probably wisdom in replacing the symbol with the substance. Ditching the label, talking directly about the properties, talking less about the central node.
Remarks:
The terms right/wrong typically only apply to actions, while good/bad apply to any events.
There is a common distinction between right and merely permissible actions. The buying of coffee is intuitively permissible. While donating the money instead of buying the coffee would be good, it’s (intuitively) not morally required, as this seems like an excessive demand. One of the main criticisms of utilitarianism is that it is an overly demanding theory. It labels actions always as wrong if their consequences are bad.
Basically every time a new model is released by a major lab, I hear from at least one person (not always the same person) that it’s a big step forward in programming capability/usefulness. And then David gives it a try, and it works qualitatively the same as everything else: great as a substitute for stack overflow, can do some transpilation if you don’t mind generating kinda crap code and needing to do a bunch of bug fixes, and somewhere between useless and actively harmful on anything even remotely complicated.
It would be nice if there were someone who tries out every new model’s coding capabilities shortly after they come out, reviews it, and gives reviews with a decent chance of actually matching David’s or my experience using the thing (90% of which will be “not much change”) rather than getting all excited every single damn time. But also, to be a useful signal, they still need to actually get excited when there’s an actually significant change. Anybody know of such a source?
EDIT-TO-ADD: David has a comment below with a couple examples of coding tasks.
While Carl Brown said (a few times) he doesn’t want to do more youtube videos for every new disappointing AI release, so far he seems to be keeping tabs on them in the newsletter just fine—https://internetofbugs.beehiiv.com/
...I am quite confident that if anything actually started to work, he would comment on it, so even if he won’t say much about any future incremental improvements, it might be a good resource to subscribe to for getting better signal—if Carl will get enthusiastic about AI coding assistants, it will be worth paying attention.
Maybe you include this in “stack overflow substitute”, but the main thing I use LLMs for is to understand well known technical things. The workflow is: 1) I am interested in understanding something, e.g. how a multiplexed barrel bit shifter works. 2) I ask the LLM to explain the concept. 3) Based on the initial response I create seperate conversation branches with questions I have (to save money and have the context be closer. Didn’t evaluate if this actually makes the LLM better.). 4) Once I think I understood the concept or part of the concept I explain it to GPT. (Really I do this all the time during the entire process.) 5) The LLM (hopeful) corrects me if I am wrong (it seems it detects mistakes more often than not).
The last part of the conversation can then looks like this:
I had probably ~200,000 words worth of conversation with LLMs, mainly in this format.
I am not sure what next leap you are talking about. But I intuit based on some observations that GPT-4o is much better for this than GPT-3 (you might talk about more recent “leaps”). (Didn’t test o1 extensively because it’s so expensive).
Have you tried to make a mistake in your understanding on purpose to test out whether it would correct you or agree with you even when you’d get it wrong?
(and if yes, was it “a few times” or “statistically significant” kinda test, please?)
Why don’t you run the test yourself seems very easy?
Yes it does catch me when I am saying wrong things quite often. It also quite often says things that are not correct and I correct it, and if I am right it usually agrees immediately.
Interesting—the first part of the response seems to suggest that it looked like I was trying to understand more about LLMs… Sorry for confusion, I wanted to clarify an aspect of your worflow that was puzzling to me. I think I got all info for what I was asking about, thanks!
FWIW, if the question was an expression of actual interest and not a snarky suggestion, my experience with chatbots has been positive for brainstorming, dictionary “search”, rubber ducking, description of common sense (or even niche) topics, but disappointing for anything that requires application of commons sense. For programmming, one- or few-liner autocomplete is fine for me—then it’s me doing the judgement, half of the suggestions are completely useless, half are fine, and the third half look fine at first before I realise I needed the second most obvious thing this time.. but it can save time for the repeating part of almost-repeating stuff. For multi file editing,, I find it worse than useless when it feels like doing code review after a psychopath pretending to do programming (AFAICT all models can explain
everythingmost stuff correctly and then write the wrong code anyway .. I don’t find it useful when it tries to appologize later if I point it out or to pre-doubt itself in CoT in 7 paragraphs and then do it wrong anyway) - I like to imagine as if it was trained on all code from GH PRs—both before and after the bug fix… or as if it was bored, so it’s trying to insert drama into a novel about my stupid programming task, when the second chapter will be about heroic AGI firefighting the shit written by previous dumb LLMs...I don’t use it to write code, or really anything. Rather I find it useful to converse with it. My experience is also that half is wrong and that it makes many dumb mistakes. But doing the conversation is still extremely valuable, because GPT often makes me aware of existing ideas that I don’t know. Also like you say it can get many things right, and then later get them wrong. That getting right part is what’s useful to me. The part where I tell it to write all my code is just not a thing I do. Usually I just have it write snippets, and it seems pretty good at that.
Overall I am like “Look there are so many useful things that GPT tells me and helps me think about simply by having a conversation”. Then somebody else says “But look it get’s so many things wrong. Even quite basic things.” And I am like “Yes, but the useful things are still useful that overall it’s totally worth it.”
Maybe for your use case try codex.
I do use LLMs for coding assistance every time I code now, and I have in fact noticed improvements in the coding abilities of the new models, but I basically endorse this. I mostly make small asks of the sort that sifting through docs or stack-overflow would normally answer. When I feel tempted to make big asks of the models, I end up spending more time trying to get the LLMs to get the bugs out than I’d have spent writing it all myself, and having the LLM produce code which is “close but not quite and possibly buggy and possibly subtly so” that I then have to understand and debug could maybe save time but I haven’t tried because it is more annoying than just doing it myself.
If someone has experience using LLMs to substantially accelerate things of a similar difficulty/flavor to transpilation of a high-level torch module into a functional JITable form in JAX which produces numerically close outputs, or implementation of a JAX/numpy based renderer of a traversable grid of lines borrowing only the window logic from, for example, pyglet (no GLSL calls, rasterize from scratch,) with consistent screen-space pixel width and fade-on-distance logic, I’d be interested in seeing how you do your thing. I’ve done both of these, with and without LLM help and I think leaning hard on the LLMs took me more time rather than less.
File I/O and other such ‘mundane’ boilerplate-y tasks work great right off the bat, but getting the details right on less common tasks still seems pretty hard to elicit from LLMs. (And breaking it down into pieces small enough for them to get it right is very time consuming and unpleasant.)
I find them quite useful despite being buggy. I spend about 40% of my time debugging model code, 50% writing my own code, and 10% prompting. Having a planning discussion first with s3.6, and asking it to write code only after 5 or more exchanges works a lot better.
Also helpful is asking for lots of unit tests along the way yo confirm things are working as you expect.
One thing I’ve noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don’t have any syntax or logical errors. I don’t think that was possible with earlier models like GPT-3.5.
My impression is that they are getting consistently better at coding tasks of a kind that would show up in the curriculum of an undergrad CS class, but much more slowly improving at nonstandard or technical tasks.
My guess is neither of you is very good at using them, and getting value out of them somewhat scales with skill.
Models can easily replace on the order of 50% of my coding work these days, and if I have any major task, my guess is I quite reliably get 20%-30% productivity improvements out of them. It does take time to figure out at which things they are good at, and how to prompt them.
Note this 50% likely only holds if you are using a main stream language. For some non-main stream language I have gotten responses that where really unbelivably bad. Things like “the name of this variable wrong” which literally could never be the problem (it was a valid identifier).
And similarly, if you are trying to encode novel concepts, it’s very different from gluing together libraries, or implementing standard well known tasks, which I would guess is what habryka is mostly doing (not that this is a bad thing to do).
Sounds plausible. Is that 50% of coding work that the LLMs replace of a particular sort, and the other 50% a distinctly different sort?
I think you’re right, but I rarely hear this take. Probably because “good at both coding and LLMs” is a light tail end of the distribution, and most of the relative value of LLMs in code is located at the other, much heavier end of “not good at coding” or even “good at neither coding nor LLMs”.
(Speaking as someone who didn’t even code until LLMs made it trivially easy, I probably got more relative value than even you.)
I’d be down to do this. Specifically, I want to do this, but I want to see if the models are qualitatively better at alignment research tasks.
In general, what I’m seeing is that there is not big jump with o1 Pro. However, it is possibly getting closer to one-shot a website based on a screenshot and some details about how the user likes their backend setup.
In the case of math, it might be a bigger jump (especially if you pair it well with Sonnet).
Regarding coding in general, I basically only prompt programme these days. I only bother editing the actual code when I notice a persistent bug that the models are unable to fix after multiple iterations.
I don’t know jackshit about web development and have been making progress on a dashboard for alignment research with very little effort. Very easy to build new projects quickly. The difficulty comes when there is a lot of complexity in the code. It’s still valuable to understand how high-level things work and low-level things the model will fail to proactively implement.
Two guesses on what’s going on with your experiences:
You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
I would contribute to a bounty for y’all to do this. I would like to know whether the slow progress is prompting-induced or not.
We did end up doing a version of this test. A problem came up in the course of our work which we wanted an LLM to solve (specifically, refactoring some numerical code to be more memory efficient). We brought in Ray, and Ray eventually concluded that the LLM was indeed bad at this, and it indeed seemed like our day-to-day problems were apparently of a harder-for-LLMs sort than he typically ran into in his day-to-day.
A thing unclear from the interaction: it had seemed towards the end that “build a profile to figure out where the bottleneck is” was one of the steps towards figuring out the problem, and that the LLM was (or might have been) better at that part. And, maybe models couldn’t solve you entire problem wholesale but there was still potential skills in identifying factorable pieces that were better fits for models.
Interesting! Two yet more interesting versions of the test:
Someone who currently gets use from LLMs writing more memory-efficient code, though maybe this is kind of question-begging
Someone who currently gets use from LLMs, and also is pretty familiar with trying to improve the memory efficiency of their code (which maybe is Ray, idk)
Recent update from OpenAI about 4o sycophancy surely looks like Standard Misalignment Scenario #325:
I don’t understand how this is an example of misalignment—are you suggesting that the model tried to be sycophantic only in deployment?
Is every undesired behavior an AI system exhibits “misalignment”, regardless of the cause?
Concretely, let’s consider the following hypothetical incident report.
Hypothetical Incident Report: Interacting bugs and features in navigation app lead to 14 mile traffic jam
Background
We offer a GPS navigation app that provides real-time traffic updates and routing information based on user-contributed data. We recently released updates which made four significant changes:
Tweak routing algorithm to have a slightly stronger preference for routes with fewer turns
Update our traffic model to include collisions reported on social media and in the app
More aggressively route users away from places we predict there will be congestion based on our traffic model
Reduced the number of alternative routes shown to users to reduce clutter and cognitive load
Our internal evaluations based on historical and simulated traffic data looked good, and A/B tests with our users indicated that most users liked these changes individually.
A few users complained about the routes we suggested, but that happens on every update.
We had monitoring metrics for the total number of vehicles diverted by a single collision, and checks to ensure that the road capacity of the road we were diverting users onto was sufficient to accommodate that many extra vehicles. However, we had no specific metrics monitoring the total expected extra traffic flow from all diversions combined.
Incident
On January 14, there was an icy section of road leading away from a major ski resort. There were 7 separate collisions within a 30 minute period on that section of road. Users were pushed to alternate routes to avoid these collisions. Over a 2 hour period, 5,000 vehicles were diverted onto a weather-affected county road with limited winter maintenance, leading to a 14 mile traffic jam and many subsequent breakdowns on that road, stranding hundreds of people in the snow overnight.
Root cause
The county road had no cell reception, and so our systems did not detect the traffic jam and continued to funnel users onto the county roadThe weather-affected county road was approximately 19 miles shorter than the next best route away from the ski resort, and so our system tried to divert vehicles onto that road until it was projected to be at capacity.
The county road was listed as having the capacity to carry 400 vehicles per hour
Each time system diverted users to avoid the collisions, it would attribute that diversion to a specific one of those collisions. When a single segment of road had multiple collisions, the logic that attributed a diversion to a collision chose in a way that depended on the origin and destinations the user had selected. In this event, attributions were spread almost uniformly across the 7 collisions.
This led to each collision independently diverting 400 vehicles per hour onto the county road
Would you say that the traffic jam happened because our software system was “misaligned”?
The emphasis here is not on properties of model behavior but on how developers relate to model testing/understanding.
So would you say that the hypothetical incident happened because our org had a poor alignment posture with regards to the software we were shipping?
I saw a recentish post challenging people to state a clear AI xrisk argument and was surprised at how poorly formed the arguments in the comments were despite the issues getting called out. So, if you’re like apparently most of LessWrong, here’s what I consider the primary reduced argument, copied with slight edits from an HN post I made a couple years ago:
I should note that having a direct argument doesn’t mean other arguments like statistical precedent, analogy to evolution, or even intuition aren’t useful. It is however good mental hygiene to track when you have short reasoning chains that don’t rely on getting analogies right, since analogies are hard[1].
Complete sidenote but I find this link fascinating. I wrote ‘analogies are hard’ thinking there there ought to be a Sequences post for that, not that there is. The post I found is somehow all the more convincing for the point I was making with how Yudkowsky messes up the discussion of neural networks. Were I the kind of person to write LessWrong posts rather than just imagine what they might be if I did, a better Analogies are hard would be one of the first.
Using technical terms that need to be looked up is not that clear an argument for most people. Here’s my preferred form for general distribution:
We are probably going to make AI entities smarter than us. If they want something different than we do, they will outsmart us somehow. They will get their way, so we won’t get ours.
This could be them wiping us out like we have done accidentally or deliberately to so many cultures and species; or it could be them just outcompeting us for every job and resource.
Nobody knows how to give AIs goals that match ours perfectly enough that we won’t be in competition. A lot of people who’ve studied this think it’s probably quite tricky.
There’s a bunch of different ways to be skeptical that doesn’t cover, but neither does your more technical formulation. For instance, some optimists assume we just won’t make AI with goals; it will remain a tool. Then you need to explain why we’ll give it goals so it can do stuff for us, but it would be easy for it to interpret those goals differently than we meant them. This is a complex discussion, so the only short form is “experts disagree, so it seems pretty dangerous to just push ahead without knowing”.
This seems rhetorically better, but I think it is implicitly relying on instrumental goals and it’s hiding that under intuitions about smartness and human competition. This will work for people who have good intuitions about that stuff, but won’t work for people who don’t see the necessity of goals and instrumental goals. I like Veedrac’s better in terms of exposing the underlying reasoning.
I think it’s really important to avoid making arguments that are too strong and fuzzy, like yours. Imagine a person reads your argument and now beliefs that intuitively smart AI entities will be dangerous, via outsmarting us etc. Then Claude 5 comes out and matches their intuition for smart AI entity, but (let’s assume) still isn’t great at goal-directedness. Then after Claude 5 hasn’t done any damage for a while, they’ll conclude that the reasoning leading to dangerousness must be wrong. Maybe they’ll think that alignment actually turned out to be easy.
Something like this seems to have already happened to a bunch of people. E.g. I’ve heard someone at Deemind say “Doesn’t constitutional AI solve alignment?”. Kat’s post here[1] seems to be basically the same error, in that Kat seems to have predicted more overt evilness from LLM agents and is surprised by the lack of that, and has thereby updated that maybe some part of alignment is actually easy. Possibly Turntrout is another example, although there’s more subtly there. I think he’s correct that, given his beliefs about where capabilities come from, the argument for deceptive alignment (an instrumental goal) doesn’t go through.
In other words, your argument is too easily “falsified” by evidence that isn’t directly relevant to the real reason for being worried about AI. More precision is necessary to avoid this, and I think Veedrac’s summary mostly succeeds at that.
You make some good points.
I think the original formulation has the same problem, but it’s a serious problem that needs to be addressed by any claim about AI danger.
I tried to address this by slipping in “AI entitities”, which to me strongly implies agency. It’s agency that creates instrumental goals, while intelligence is more arguably related to agency and through it to instrumental goals. I think this phrasing isn’t adequate based on your response, and expecting even less attention to the implications of “entities” from a general audience.
That concern was why I included the caveat about addressing agency. Now I think that probably has to be worked into the main claim. I’m not sure how to do that; one approach is making an analogy to humans along the lines of “we’re going to make AIs that are more like humans because we want AI that can do work for us… that includes following goals and solving problems along the way… ”
This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually. That’s one strategy for evoking the intuition that AI will be highly goal-directed and agentic. I’ve tried a lot of different terms like “entities” and “minds” to evoke that intuition, but “human-like” might be the strongest even though it comes at a steep cost.
If we can clearly tie the argument for AGI x-risk to agency, I think it won’t have the same problem, because I think we’ll see instrumental convergence as soon as we deploy even semi-competent LLM agents. They’ll do unexpected stuff for both rational and irrational reasons.
I think the original formulation having the same problem. It starts with the claim
One could say “well LLMs are already superhuman at some stuff and they don’t seem to have instrumental goals”. And that will become more compelling as LLMs keep getting better in narrow domains.
Kat Woods’ tweet is an interesting case. I actually think her point is absolutely right as far as it goes, but it doesn’t go quite as far as she seems to think. I’m even tempted to engage on Twitter, a thing I’ve been warned to never do on pain of endless stupid arguments if you can’t ignore hecklers :) It’s addressing a different point than instrumental goals, but it’s also an important point. The specification problem is, I think, much improved by having LLMs as the base intelligence. But it’s not solved, because there’s not a clear “goal slot” in LLMs or LLM agents in which to insert that nice representation of what we want. I’ve written about these conflicting intuitions/conclusions in Cruxes of disagreement on alignment difficulty, largely by referencing the excellent Simplicia/Doomimir debates.
Yeah agreed, and it’s really hard to get the implications right here without a long description. In my mind entities didn’t trigger any association with agents, but I can see how it would for others.
I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind many people have massively overanchored on the particular properties of current LLMs and LLM agents. I’m less a fan of your part of that post that involves accelerating anything.
Yeah, but the line “capable systems necessarily have instrumental goals” helps clarify what you mean by “capable systems”. It must be some definition that (at least plausibly) implies instrumental goals.
Huh I suspect that the disagreement about that tweet might come from dumb terminology fuzziness. I’m not really sure what she means by “the specification problem” when we’re in the context of generative models trained to imitate. It’s a problem that makes sense in a different context. But the central disagreement is that she thinks current observations (of “alignment behaviour” in particular) are very surprising, which just seems wrong. My response was this:
Mostly agreed. When suggesting even differential acceleration I should remember to put a big WE SHOULD SHUT IT ALL DOWN just to make sure it’s not taken out of context. And as I said there, I’m far from certain that even that differential acceleration would be useful.
I agree that Kat Woods is overestimating how optimistic we should be based on LLMs following directions well. I think re-litigating who said what when and what they’d predict is a big mistake since it is both beside the point and tends to strengthen tribal rivalries—which are arguably the largest source of human mistakes. There is an interesting, subtle issue there which I’ve written about in The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment. There are potential ways to leverage LLM’s relatively rich (but imperfect) understanding into AGI that follows someone’s instructions. Creating a “goal slot” based on linguistic instructions is possible. But it’s all pretty complex and uncertain.
I think Robert Miles does excellent introductory videos for newer people, and I linked him in the HN post. My goal here was different, though, which was to give a short, affirmative argument made of only directly defensible high probability claims.
I like your spin on it, too, more than those given in the linked thread, but it’s still looser, and I think there’s value giving an argument where it’s harder to disagree with the conclusion without first disagreeing with a premise. Eg. ‘some optimists assume we just won’t make AI with goals’ directly contradicts ‘capable systems necessarily have instrumental goals’, but I’m not sure it directly contradicts a premise you used.
I donno, the systems we have seem pretty capable, and if they have instrumental goals they seem quite weak… so tossing in that claim seems like just asking for trouble. I do think that very capable systems almost need to have goals, but I have trouble making that argument even to alignment people and rationalists.
That’s just one example, but the fact that it goes awry immediately hints that the whole direction is a bad idea.
I think the argument for AI being quite-possibly dangerous is actually a lot stronger than the more abstract and technical argument usually used by rationalists. It doesn’t require any strong claims at all. People don’t need certainty to be quite alarmed, and for good reason.
Standard xrisk arguments generally don’t extrapolate down to systems that don’t solve tasks that require instrumental goals. I think it’s reasonable to say common LLMs don’t exhibit many instrumental goals, but they also can’t solve for long-horizon goal-directed problem solving.
Prosaic risks like biorisk evals often go further and ask, if we assume the AI systems aren’t themselves very capable at this task, can we still exhibit dangerous behaviors from them ‘in the loop’? These are legitimate and interesting questions but they are a different thing.
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it’ll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn’t need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
Researchers at AGI labs seem to genuinely believe the hype they’re selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold.
Dismissing short timelines based on NSA’s behavior requires assuming that they’re much more competent in the field of AI than everyone in the above list. After all, that’d require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect.
While that’s not impossible, it seems highly unlikely to me. Much more likely that they’re significantly less competent, and accordingly dismissive.
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
https://amistrongeryet.substack.com/p/defining-agi
Months ago I suggested that you could manipulate the popular LLMs by mass publishing ideological text online. Well this has now been done by Russia.
We should expect LLMs to get just as contaminated as Google search soon. Russia does it for ideological purposes, but I imagine that hundreds of companies already do it for commercial reasons. Why pay for advertisement, if you can generate thousands of pages promoting your products that will be used to train the next generation of LLMs?
Are ads externalities? In the sense of, imposed upon people who don’t get a say in the matter?
My initial reaction was roughly: “web/TV/magazine ads no, you can just not visit/watch/read. Billboards on the side of the road yes”. But you can also just not take that road. Like, if someone built a new road and put up a billboard there, and specifically said “I’m funding the cost of this road with the billboard, take it or leave it”, that doesn’t feel like an externality. Why is it different if they build a road and then later add a billboard?
But if we go that far, we can say “people can move away from polluted areas”! And that feels wrong.
Hm. So the reason to care about externalities is, the cost of them falls on people who don’t get to punish the parties causing the externalities. Like, a factory doesn’t care if I move away from a polluted area.
So a web ad, I go to the website to get content and I see ads. If the ads are unpleasant enough that they’re not worth the content, I don’t visit the website. The website wants me to visit (at least in part so they can show me ads), so they don’t want that outcome. The ad can’t bring the value I get from the website below 0 (or I’ll not go), and it can’t dissuade so many people that it overall provides negative value to the website (or they won’t show it). They can destroy value relative to the world with no ads on the website, but not relative to the world without this website. For, uh, certain definitions of “can’t”. (And: ads on websites that also want me there for reasons other than showing me ads, are probably less bad than ads on websites that only want me there to show me ads.)
For a billboard on the side of the road… well, who’s getting paid? Is it typically the person who built the road? If not, the incentives are fucked. Even if it is… like the website the road is giving me value, and the ad still can’t bring the value it gives me below 0.
One difference: most of the value to me is getting me somewhere else, and the people and businesses at my destination probably value me being there, so if the value I get from the road goes to 0 and I stop taking it, that’s bad for other people too. Another difference: what’s the equivalent of “the world without this website”? Maybe if there’s no road there I could walk, but walking is now less pleasant than it would have been because of the road (and the billboard that isn’t narrowly targeted to drivers). Also feels significant that no one else can build a road there, and give me a choice of tolls or ads. The road itself is an externality.
I think my main conclusions here are
Billboards still feel like an externality in a way that web ads don’t
If I want to think about, like, idealized models of free-market transactions, roads are a really weird thing to model.
I wonder how efficient the ads actually are. And I realize that the ad companies have absolutely no incentive to tell the truth.
When you watch ads, you pay in two ways. First, by having your time and attention wasted, and possibly being annoyed a lot. This part is a pure loss for you; it doesn’t benefit the companies paying for ads in any way. Second (assuming that the ads really work as advertised), by sometimes, as a consequence of watching an ad, buying something you wouldn’t have bought otherwise—most likely, because you don’t actually need it.
The things you buy as a consequence of having watched the ad… in theory it could be something useful that you didn’t know about, so the ad really made you a service by telling you about it… yeah, that’s what people working for the advertising agencies will tell you (and you should trust them about as much as you trust people working for tobacco companies when they tell you about the benefits of smoking, or rather vaping these days). But we all know that most ads are not like that. Many of them actually about things you already know about; it’s just an ongoing work to keep a specific brand firmly implanted in your mind. Also, on internet, many ads are outright scams. I would say that you are 10x or 100x more likely to see a scam ad, than an ad for something that is really useful, that you didn’t know before, that comes at a reasonable price.
So, it seems to me that if we analyze the proposal “instead of money, you can pay by watching a few ads”, well, either the ads work—in which case, as a user, you save some money, but then lose a lot of time and attention and emotional energy… and ultimately, as a result of successful manipulation spend money on something you otherwise wouldn’t buy, so… perhaps paying for the service would have been cheaper, even from the strictly financial perspective—or maybe the ads don’t work, and this all is just an elaborate ritual to scam companies out of their money (most of them don’t do A/B testing on the effectiveness of ads, so they have no way to know), and as a user, you have the bad luck that your time and attention need to be sacrificed to make this ritual work. Realistically, it is probably something in between.
*
That said, the argument in favor of ad industry is that it works. Making people pay using their time and attention is technologically easier than making them pay using Paypal or Visa. The payment is proportional to time spent using the service, which is a great advantage: if you e.g. only watch one YouTube video per week, your suffering is effectively a micropayment.
Generally, paying for services makes more sense if you use one service a lot, and becomes unmanageable if you use hundred services, each of them only a little. Most companies probably wouldn’t agree that you only need to pay them a cent or two a month if you only use their services a little (such as one video or one article per week); and if everyone required a minimum monthly payment of one dollar, the costs would quickly add up if you had to pay literally for every website you visit online. (And that’s even ignoring the part about how half of those subscriptions would be extremely difficult to cancel.)
I think it would be nice if someone figured out a way how e.g. every person on the planet could pay a fixed monthly “internet fee” and the money would be distributed to websites proportionally to how much time the person spends there. But this comes with all kinds of problems, starting with weird incentives (when you make your web pages load slower, you get paid more), how would the height of the “internet fee” be determined, etc.
Absence of evidence is the dark matter of inference. It’s invisible yet it’s paramount to good judgement.
It’s easy to judge X to be true if you see some evidence that could only come about if X were true. It’s a lot more subtle to judge X to be false if you do see some evidence that it’s true, but you can also determine that there are lots of evidence that you would expect to have if it were true, but that is missing.
In a formalized setting like a RCT this is not an issue, but when reasoning in the wild, this is the norm. I’m guessing this leads to a bias of too many false positives on any issue where you care to look deeply enough to find and cherry pick the positive evidence.
EDIT: Correcting the opening sentence to say “absence of evidence” rather than the original “negative evidence”.
Ban on ASI > Open source ASI > Closed source ASI
This is my ordering.
Yudkowsky’s worldview in favour of closed source ASI is sitting in multiple shaky assumptions. One of these assumptions is that getting a 3-month to 3-year lead is necessary and sufficient condition for alignment to be solved. Yudkowsky!2025 himself doesn’t believe alignment can be solved in 3 years.
Why does anybody on lesswrong want closed source ASI?
If I got $1M in funding, I’d use it towards some or all of the following projects.
The objective is to get secret information out of US ASI orgs (including classified information) and host it in countries outside the US. Hopefully someone else can use this info to influence US and world politics.
Black DAQ
whistleblower/spy guide
hacker guide
Grey DAQ
internet doxxing tool
drones/cctv outside offices/ datacentres
High attention
persuade Indian, Russian, Chinese journalists to run a SecureDrop-like system
digital journalism guide
OR run a journalist outlet outside the US myself, until I can persuade existing journalists to do better
All the DAQ will be aimed at leadership and employees of people involved in building ASI.
I’m not very optimistic on grey DAQ uncovering high-profile info, just that it could force the ASI company employees to isolate further from rest of society. I know many people have moral qualms about it but I don’t. I see it as more-or-less inevitable, and given that it is inevitable I’d rather have it work for everyone and against everyone, than let those with power alone decide who it gets used against.
More details
Whistleblower guide
Not bottlenecked
Will work on this
tldr, whistleblowers should focus on getting to russia like Snowden did, instead of improving their opsec and hoping to stay anonymous
Hacker guide
Knowledge bottlenecked
I don’t know enough to offer them technical advice. Mostly I’ll offer moral support and maybe some legal advice
Internet doxxing tool
Weakly capital bottlenecked
I tried building this by doing embedding search and anomalous word counts on reddit extract of commoncrawl. This will likely work better as a two pass system, first pass use PII, second pass do stylometrics.
I need capital for more servers, and maybe to purchase some PII datasets similar to whitepages/weleakinfo/snusbase. I need to check what price points these datasets tend to get sold at.
Drones/CCTV outside offices / datacentres
Capital bottleneck
Need capital for lawyers, and for setting up the cameras
This is legal in US (not UK) but I need to check legal precedents on this
High attention guide
Weakly capital bottlenecked. Attention bottlenecked.
Journalists-by-training suck both at opsec and at becoming popular on the internet.
Opsec means things like advertising a SecureDrop-like system or a Signal number.
Becoming popular means things like understanding heavy-tailed distribution of attention and importance of building a brand around your face and understanding what readers want to read.
Journalists-by-training are being replaced by YouTubers across US, Europe, India and Russia atleast.
I’m unsure if I should be trying to teach existing journalists this stuff or just run a non-US journalist outlet myself. Having funding and a public brand will enable me to try both approaches.
Misc
If I had $1M I’d definitely select a lawyer with experience in international law.
If I had $1M I’d also spend atleast $100/mo each on a security guard (as needed), a Chinese language teacher and a therapist.
I could fill this up with even more details if I had more time. Wanted to get a quick reaction.
I know most people on LW will be against this sort of plan for reasons I don’t have motivation to sit and critique right now (maybe go read my blog). I’m more interested in hearing from the handful of people who will be for it.
Why do you want to do this as a lone person rather than e.g. directly working with the intelligence service of some foreign adverary?
I think for a lot of societal change to happen, information needs to be public first. (Then it becomes common knowledge, then an alternate plan gets buy-in, then that becomes common knowledge and so on.)
A foreign adversary getting the info doesn’t mean it’s public, although it has increased the number of actors N who now have that piece of info in the world. Large N is not stable so eventually the info may end up public anyway.
Prediction: micropayments are finally going to actually take off this year and next, as AIs start using tools at scale which are too expensive to serve at unlimited volumes to non-ad-watching users free of charge, but are not valuable enough per invocation to justify the overhead of using credit card rails. Whichever of the big chat companies first has “the model cannot pay $0.001 on your behalf to use a highly useful tool the other companies’ models can’t use” it’s going to add significant pressure for the other companies to start offering it too.
Saving mathematician Robert Ghrist’s tweet here for my own future reference re: AI x math:
Someone asked why split things between o3 and 2.5 Pro; Ghrist:
As an aside, I’ve noticed that the math subreddit tends to be exceedingly negative on AI x math in a way that seems ignorant of recent progress and weirdly defensive without being all that aware of it, while some of the top mathematicians seem to be pretty excited about it, like Terry Tao cf. his most recent post A proof of concept tool to verify estimates:
I came up with an argument for alignment by default.
In the counterfactual mugging scenario, a rational agent gives the money, even though they never see themselves benefitting from it. Before the coin flip, the agent would want to self-modify to give the money to maximize the expected value, therefore the only reflectively stable option is to give the money.
Now imagine instead of a coin flip, it’s being born as one of two people: Alice, who wants to not be murdered for 100 utils, and Bob, who wants to murder Alice for 1 utils. As with the counterfactual mugging, before you’re born, you’d rationally want to self-modify to not murder Alice to maximize the expected value.
What you end up with is basically morality (or at least it is the only rational choice regardless of your morality), so we should expect sufficiently intelligent agents to act morally.
Counterfactual mugging is a mug’s game in the first place—that’s why it’s called a “mugging” and not a “surprising opportunity”. The agent don’t know that Omega actually flipped a coin, would have paid you counterfactually if the agent was the sort of person to pay in this scenario, would have flipped the coin at all in that case, etc. The agent can’t know these things, because the scenario specifies that they have no idea that Omega does any such thing or even that Omega existed before being approached. So a relevant rational decision-theoretic parameter is an estimate of how much such an agent would benefit, on average, if asked for money in such a manner.
A relevant prior is “it is known that there are a lot of scammers in the world who will say anything to extract cash vs zero known cases of trustworthy omniscient beings approaching people with such deals”. So the rational decision is “don’t pay” except in worlds where the agent does know that omniscient trustworthy beings vastly outnumber untrustworthy beings (whether omniscient or not), and those omniscient trustworthy beings are known to make these sorts of deals quite frequently.
Your argument is even worse. Even broad decision theories that cover counterfactual worlds such as FDT and UDT still answer the question “what decision benefits agents identical to Bob the most across these possible worlds, on average”. Bob does not benefit at all in a possible world in which Bob was Alice instead. That’s nonexistence, not utility.
I don’t know what the first part of your comment is trying to say. I agree that counterfactual mugging isn’t a thing that happens. That’s why it’s called a thought experiment.
I’m not quite sure what the last paragraph is trying to say either. It sounds somewhat similar to an counter-argument I came up with (which I think is pretty decisive), but I can’t be certain what you actually meant. In any case, there is the obvious counter-counter-argument that in the counterfactual mugging, the agent in the heads branch and the tails branch are not quite identical either, one has seen the coin land on heads and the other has seen the coin land on tails.
Regarding the first paragraph: every purported rational decision theory maps actions to expected values. In most decision theory thought experiments, the agent is assumed to know all the conditions of the scenario, and so they can be taken as absolute facts about the world leaving only the unknown random variables to feed into the decision-making process. In the Counterfactual Mugging, that is explicitly not true. The scenario states
So it’s not enough to ask what a rational agent with full knowledge of the rest of the scenario should do. That’s irrelevant. We know it as omniscient outside observers, but the agent in question knows only what the mugger tells them. If they believe it then there is a reasonable argument that they should pay up, but there is nothing given in the scenario that makes it rational to believe the mugger. The prior evidence is massively against believing the mugger. Any decision theory that ignores this is broken.
Regarding the second paragraph: yes, indeed there is that additional argument against paying up and rationality does not preclude accepting that argument. Some people do in fact use exactly that argument even in this very much weaker case. It’s just a billion times stronger in the “Bob could have been Alice instead” case and makes rejecting the argument untenable.
Am I correct in assuming you don’t think one should give the money in the counterfactual mugging?
The Meta-LessWrong Doomsday Argument (MLWDA) predicts long AI timelines and that we can relax:
LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the ‘Doomsday argument’ prior to this one, and it is now 2025, at 2.75 mentions per year.
By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)
By a curious coincidence, double-checking to see if really no one had made a meta-DA before, it turns out that Alexey Turchin has made a meta-DA as well about 7 years ago, calculating that
His estimate of 2043 is surprisingly close to 2041.
We offer no explanation as to why this numerical consilience of meta-DA calculations has happened; we attribute their success, as all else, to divine benevolence.
Regrettably, the 2041--2043 date range would seem to imply that it is unlikely we will obtain enough samples of the MLWDA in order to compute a Meta-Meta-LessWrong Doomsday Argument (MMLWDA) with non-vacuous confidence intervals, inasmuch as every mention of the MLWDA would be expected to contain a mention of the DA as well.
I’ve thought about the doomsday argument more than daily for the past 15 years, enough for me to go from “Why am I improbably young?” to “Oh, I guess I’m just a person who thinks about the doomsday argument a lot”
Fun “fact”: when a person thinks about the doomsday argument, they a decent change of being me.
This is an alarming point, as I find myself thinking about the DA today as well; I thought I was ‘gwern’, but it is possible I am ‘robo’ instead, if robo represents such a large fraction of LW-DA observer-moments. It would be bad to be mistaken about my identity like that. I should probably generate some random future dates and add them to my Google Calendar to check whether I am thinking about the DA that day and so have evidence I am actually robo instead.
Nice example of taking inside view vs outside view seriously.
I think taking in to account the Meta-Meta-LessWrong Doomsday Analysis (MMLWDA) reveals an even deeper truth: your calculation fails to account for the exponential memetic acceleration of doomsday-reference-self-reference.
You’ve correctly considered that before your post, there were 44 mentions in 16 years (2.75/year); however, now you’ve created the MLWDA argument—noticeably more meta than previous mentions. This meta-ness increase is quite likely to trigger cascading self-referential posts (including this one).
The correct formulation should incorporate the Meta-Meta-Carcinization Principle (MMCP): all online discourse eventually evolves into recursive self-reference at an accelerating rate. Given my understanding of historical precedent from similar rat and rat adjacent memes, I’d estimate approximately 12-15 direct meta-responses to your post within the next month alone, and see no reason to expect the exponential to turn sigmoid in timescales that render my below argument unlikely.
This actually implies a much sooner endpoint distribution—the discourse will become sufficiently meta by approximately November 2027 that it will collapse into a singularity of self-reference, rendering further mentions both impossible and unnecessary.
However, you can’t use this argument because unlike the MLWDA, where I am arguably a random observer of LW DA instances (the thought was provoked by Michael Nielsen linking to Cosma Shalizi’s notes on Mesopotamia and me thinking that the temporal distances are much less impressive if you think of them in terms of ‘nth human to live’, which immediately reminded me of DA and made me wonder if anyone had done a ‘meta-DA’, and LW simply happened to be the most convenient corpus I knew of to accurately quantify ‘# of mentions’ as tools like Google Scholar or Google N-Grams have a lot of issues—I have otherwise never taken much of an interest in the DA and AFAIK there have been no major developments recently), you are in a temporally privileged position with the MMLWDA, inasmuch as you are the first responder to my MLWDA right now, directly building on it in a non-randomly-chosen-in-time fashion.
Thus, you have to appeal purely to non-DA grounds like making a parametric assumption or bringing in informative priors from ‘similar rat and rat adjacent memes’, and that’s not a proper MMLWDA. That’s just a regular old prediction.
Turchin actually notes this issue in his paper, in the context of, of course, the DA and why the inventor Brandon Carter could not make a Meta-DA (but he and I could):
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I’ve been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I’ve written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these phenomenon better, that’s great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.
I’m also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren’t as useful as linear probing, or even careful black box experiments, that seems important to know!
Here’s the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0
Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.
Thanks for doing this— I found it really helpful.
Very happy you did this!
Really helpful work, thanks a lot for doing it