Interested in math puzzles, fermi estimation, strange facts about the world, toy models of weird scenarios, unusual social technologies, and deep dives into the details of random phenomena.
Working on the pretraining team at Anthropic as of October 2024; before that I did independent alignment research of various flavors and worked in quantitative finance.
Drake Thomas
The theoretical maximum FLOPS of an Earth-bound classical computer is something like .
Is this supposed to have a different base or exponent? A single H100 already gets like FLOP/s.
So I would guess it should be possible to post-train an LLM to give answers like ”................… Yes” instead of “Because 7! contains both 3 and 5 as factors, which multiply to 15 Yes”, and the LLM would still be able to take advantage of CoT
This doesn’t necessarily follow—on a standard transformer architecture, this will give you more parallel computation but no more serial computation than you had before. The bit where the LLM does N layers’ worth of serial thinking to say “3” and then that “3″ token can be fed back into the start of N more layers’ worth of serial computation is not something that this strategy can replicate!
Empirically, if you look at figure 5 in Measuring Faithfulness in Chain-of-Thought Reasoning, adding filler tokens doesn’t really seem to help models get these questions right:
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.
Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.
In other words, they can think about the thoughts “behind” the previous words they wrote. If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its “attention” modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur “behind the scenes” of the text you actually see on screen.
I don’t think that asking an LLM what its words are referring to is a convincing demonstration that there’s real introspection going on in there, as opposed to “plausible confabulation from the tokens written so far”. I think it is plausible there’s some real introspection going on, but I don’t think this is a good test of it—the sort of thing I would find much more compelling is if the LLMs could reliably succeed at tasks like
Human: Please think of a secret word, and don’t tell me what it is yet.
LLM: OK!
Human: What’s the parity of the alphanumeric index of the penultimate letter in the word, where A=1, B=2, etc?
LLM: Odd.
Human: How many of the 26 letters in the alphabet occur multiple times in the word?
LLM: None of them.
Human: Does the word appear commonly in two-word phrases, and if so on which side?
LLM: It appears as the second word of a common two-word phrase, and as the first word of a different common two-word phrase.
Human: Does the word contain any other common words as substrings?
LLM: Yes; it contains two common words as substrings, and in fact is a concatenation of those two words.
Human: What sort of role in speech does the word occupy?
LLM: It’s a noun.
Human: Does the word have any common anagrams?
LLM: Nope.
Human: How many letters long is the closest synonym to this word?
LLM: Three.
Human: OK, tell me the word.
LLM: It was CARPET.but couldn’t (even with some substantial effort at elicitation) infer hidden words from such clues without chain-of-thought when it wasn’t the one to think of them. That would suggest to me that there’s some pretty real reporting on a piece of hidden state not easily confabulated about after the fact.
I think my original comment was ambiguous—I also consider myself to have mostly figured it out, in that I thought through these considerations pretty extensively before joining and am in a “monitoring for new considerations or evidence or events that might affect my assessment” state rather than a “just now orienting to the question” state. I’d expect to be most useful to people in shoes similar to my past self (deciding whether to apply or accept an offer) but am pretty happy to talk to anyone, including eg people who are confident I’m wrong and want to convince me otherwise.
See my reply to Ryan—I’m primarily interested in offering advice on something like that question since I think it’s where I have unusually helpful thoughts, I don’t mean to imply that this is the only question that matters in making these sorts of decisions! Feel free to message me if you have pitches for other projects you think would be better for the world.
Yeah, I agree that you should care about more than just the sign bit. I tend to think the magnitude of effects of such work is large enough that “positive sign” often is enough information to decide that it dominates many alternatives, though certainly not all of them. (I also have some kind of virtue-ethical sensitivity to the zero point of the impacts of my direct work, even if second-order effects like skill building or intra-lab influence might make things look robustly good from a consequentialist POV.)
The offer of the parent comment is more narrowly scoped, because I don’t think I’m especially well suited to evaluate someone else’s comparative advantages but do have helpful things to say on the tradeoffs of that particular career choice. Definitely don’t mean to suggest that people (including myself) should take on capability-focused roles iff they’re net good!
I did think a fair bit about comparative advantage and the space of alternatives when deciding to accept my offer; I’ve put much less work into exploration since then, arguably too much less (eg I suspect I don’t quite meet Raemon’s bar). Generally happy to get randomly pitched on things, I suppose!
I work on a capabilities team at Anthropic, and in the course of deciding to take this job I’ve spent[1] a while thinking about whether that’s good for the world and which kinds of observations could update me up or down about it. This is an open offer to chat with anyone else trying to figure out questions of working on capability-advancing work at a frontier lab! I can be reached at “graham’s number is big” sans spaces at gmail.
- ^
and still spend—I’d like to have Joseph Rotblat’s virtue of noticing when one’s former reasoning for working on a project changes.
- ^
Drake Thomas’s Shortform
I agree it seems unlikely that we’ll see coordination on slowing down before one actor or coalition has a substantial enough lead over other actors that it can enforce such a slowdown unilaterally, but I think it’s reasonably likely that such a lead will arise before things get really insane.
A few different stories under which one might go from aligned “genius in a datacenter” level AI at time t to outcomes merely at the level of weirdness in this essay at t + 5-10y:
The techniques that work to align “genius in a datacenter” level AI don’t scale to wildly superhuman intelligence (eg because they lose some value fidelity from human-generated oversight signals that’s tolerable at one remove but very risky at ten). The alignment problem for serious ASI is quite hard to solve at the mildly superintelligent level, and it genuinely takes a while to work out enough that we can scale up (since the existing AIs, being aligned, won’t design unaligned successors).
If people ask their only-somewhat-superhuman AI what to do next, the AIs say “A bunch of the decisions from this point on hinge on pretty subtle philosophical questions, and frankly it doesn’t seem like you guys have figured all this out super well, have you heard of this thing called a long reflection?” That’s what I’d say if I were a million copies of me in a datacenter advising a 2024-era US government on what to do about Dyson swarms!
A leading actor uses their AI to ensure continued strategic dominance and prevent competing AI projects from posing a meaningful threat. Having done so, they just… don’t really want crazy things to happen really fast, because the actor in question is mostly composed of random politicians or whatever. (I’m personally sympathetic to astronomical waste arguments, but it’s not clear to me that people likely to end up with the levers of power here are.)
The serial iteration times and experimentation loops are just kinda slow and annoying, and mildly-superhuman AI isn’t enough to circumvent experimentation time bottlenecks (some of which end up being relatively slow), and there are stupid zoning restrictions on the land you want to use for datacenters, and some regulation adds lots of mandatory human overhead to some critical iteration loop, etc.
This isn’t a claim that maximal-intelligence-per-cubic-meter ASI initialized in one datacenter would face long delays in making efficient use of its lightcone, just that it might be tough for a not-that-much-better-than-human AGI that’s aligned and trying to respect existing regulations and so on to scale itself all that rapidly.
Among the tech unlocked in relatively early-stage AGI is better coordination, and that helps Earth get out of unsavory race dynamics and decide to slow down.
The alignment tax at the superhuman level is pretty steep, and doing self-improvement while preserving alignment goes much slower than unrestricted self-improvement would; since at this point we have many fewer ongoing moral catastrophes (eg everyone who wants to be cryopreserved is, we’ve transitioned to excellent cheap lab-grown meat), there’s little cost to proceeding very cautiously.
This is sort of a continuous version of the first bullet point with a finite rather than infinite alignment tax.
All that said, upon reflection I think I was probably lowballing the odds of crazy stuff on the 10y timescale, and I’d go to more like 50-60% that we’re seeing mind uploads and Kardashev level 1.5-2 civilizations etc. a decade out from the first powerful AIs.
I do think it’s fair to call out the essay for not highlighting the ways in which it might be lowballing things or rolling in an assumption of deliberate slowdown; I’d rather it have given more of a nod to these considerations and made the conditions of its prediction clearer.
(I work at Anthropic.) My read of the “touch grass” comment is informed a lot by the very next sentences in the essay:
But more importantly, tame is good from a societal perspective. I think there’s only so much change people can handle at once, and the pace I’m describing is probably close to the limits of what society can absorb without extreme turbulence.
which I read as saying something like “It’s plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn’t want things to get incredibly crazy so fast, and so we’re likely to see something tamer.” I basically agree with that.
Do Anthropic employees who think less tame outcomes are plausible believe Dario when he says they should “touch grass”?
FWIW, I don’t read the footnote as saying “if you think crazier stuff is possible, touch grass”—I read it as saying “if you think the stuff in this essay is ‘tame’, touch grass”. The stuff in this essay is in fact pretty wild!
That said, I think I have historically underrated questions of how fast things will go given realistic human preferences about the pace of change, and that I might well have updated more in the above direction if I’d chatted with ordinary people about what they want out of the future, so “I needed to touch grass” isn’t a terrible summary. But IMO believing “really crazy scenarios are plausible on short timescales and likely on long timescales” is basically the correct opinion, and to the extent the essay can be read as casting shade on such views it’s wrong to do so. I would have worded this bit of the essay differently.
Re: honesty and signaling, I think it’s true that this essay’s intended audience is not really the crowd that’s already gamed out Mercury disassembly timelines, and its focus is on getting people up to shock level 2 or so rather than SL4, but as far as I know everything in it is an honest reflection of what Dario believes. (I don’t claim any special insight into Dario’s opinions here, just asserting that nothing I’ve seen internally feels in tension with this essay.) Like, it isn’t going out of its way to talk about the crazy stuff, but I don’t read that omission as dishonest.
For my own part:
I think it’s likely that we’ll get nanotech, von Neumann probes, Dyson spheres, computronium planets, acausal trade, etc in the event of aligned AGI.
Whether that stuff happens within the 5-10y timeframe of the essay is much less obvious to me—I’d put it around 30-40% odds conditional on powerful AI from roughly the current paradigm, maybe?
In the other 60-70% of worlds, I think this essay does a fairly good job of describing my 80th percentile expectations (by quality-of-outcome rather than by amount-of-progress).
I would guess that I’m somewhat more Dyson-sphere-pilled than Dario.
I’d be pretty excited to see competing forecasts for what good futures might look like! I found this essay helpful for getting more concrete about my own expectations, and many of my beliefs about good futures look like “X is probably physically possible; X is probably usable-for-good by a powerful civilization; therefore probably we’ll see some X” rather than having any kind of clear narrative about how the path to that point looks.
I’ve fantasized about a good version of this feature for math textbooks since college—would be excited to beta test or provide feedback about any such things that get explored! (I have a couple math-heavy posts I’d be down to try annotating in this way.)
(I work on capabilities at Anthropic.) Speaking for myself, I think of international race dynamics as a substantial reason that trying for global pause advocacy in 2024 isn’t likely to be very useful (and this article updates me a bit towards hope on that front), but I think US/China considerations get less than 10% of the Shapley value in me deciding that working at Anthropic would probably decrease existential risk on net (at least, at the scale of “China totally disregards AI risk” vs “China is kinda moderately into AI risk but somewhat less than the US”—if the world looked like China taking it really really seriously, eg independently advocating for global pause treaties with teeth on the basis of x-risk in 2024, then I’d have to reassess a bunch of things about my model of the world and I don’t know where I’d end up).
My explanation of why I think it can be good for the world to work on improving model capabilities at Anthropic looks like an assessment of a long list of pros and cons and murky things of nonobvious sign (eg safety research on more powerful models, risk of leaks to other labs, race/competition dynamics among US labs) without a single crisp narrative, but “have the US win the AI race” doesn’t show up prominently in that list for me.
A proper Bayesian currently at less 0.5% credence for a proposition P should assign a less than 1 in 100 chance that their credence in P rises above 50% at any point in the future. This isn’t a catch for someone who’s well-calibrated.
In the example you give, the extent to which it seems likely that critical typos would happen and trigger this mechanism by accident is exactly the extent to which an observer of a strange headline should discount their trust in it! Evidence for unlikely events cannot be both strong and probable-to-appear, or the events would not be unlikely.
Catastrophic Regressional Goodhart: Appendix
An example of the sort of strengthening I wouldn’t be surprised to see is something like “If is not too badly behaved in the following ways, and for all we have [some light-tailedness condition] on the conditional distribution , then catastrophic Goodhart doesn’t happen.” This seems relaxed enough that you could actually encounter it in practice.
- 11 May 2023 1:22 UTC; 4 points) 's comment on When is Goodhart catastrophic? by (
I’m not sure what you mean formally by these assumptions, but I don’t think we’re making all of them. Certainly we aren’t assuming things are normally distributed—the post is in large part about how things change when we stop assuming normality! I also don’t think we’re making any assumptions with respect to additivity; is more of a notational or definitional choice, though as we’ve noted in the post it’s a framing that one could think doesn’t carve reality at the joints. (Perhaps you meant something different by additivity, though—feel free to clarify if I’ve misunderstood.)
Independence is absolutely a strong assumption here, and I’m interested in further explorations of how things play out in different non-independent regimes—in particular we’d be excited about theorems that could classify these dynamics under a moderately large space of non-independent distributions. But I do expect that there are pretty similar-looking results where the independence assumption is substantially relaxed. If that’s false, that would be interesting!
Late edit: Just a note that Thomas has now published a new post in the sequence addressing things from a non-independence POV.
.00002% — that is, one in five hundred thousand
0.00002 would be one in five hundred thousand, but with the percent sign it’s one in fifty million.
Indeed, even on basic Bayesianism, volatility is fine as long as the averages work out
I agree with this as far as the example given, but I want to push back on oscillation (in the sense of regularly going from one estimate to another) being Bayesian. In particular, the odds you should put on assigning 20% in the future, then 30% after that, then 20% again, then 30% again, and so on for ten up-down oscillations, shouldn’t be more than half a percent, because each 20 → 30 jump can be at most 2⁄3 probable and each 30 → 20 jump at most 7⁄8 (and ).
So it’s fine to think that you’ve got a decent chance of having all kinds of credences in the future, but thinking “I’ll probably feel one of two ways a few times a week for the next year” is not the kind of belief a proper Bayesian would have. (Not that I think there’s an obvious change to one’s beliefs you should try to hammer in by force, if that’s your current state of affairs, but I think it’s worth flagging that something suboptimal is going on when this happens.)
I’ve gotten enormous value out of LW and its derived communities during my life, at least some of which is attributable to the LW2.0 revival and its effects on those communities. More recently, since moving to the Bay, I’ve been very excited by a lot of the in-person events that Lighthaven has helped facilitate. Also, LessWrong is doing so many things right as a website and source-of-content that no one else does (karma-gated RSS feeds! separate upvote and agree-vote! built-in LaTeX support!) and even if I had no connection to the other parts of its mission I’d want to support the existence of excellently-done products. (Of course there’s also the altruistic case for impact on how-well-the-future-goes, which I find compelling on its own merits.) Have donated $5k for now, but I might increase that when thinking more seriously about end-of-year donations.
(Conflict of interest notice: two of my housemates work at Lightcone Infrastructure and I would be personally sad and slightly logistically inconvenienced if they lost their jobs. I don’t think this is a big contributor to my donation.)