Interested in math puzzles, fermi estimation, strange facts about the world, toy models of weird scenarios, unusual social technologies, and deep dives into the details of random phenomena.
Working on the pretraining team at Anthropic as of October 2024; before that I did independent alignment research of various flavors and worked in quantitative finance.
Drake Thomas
Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.
In other words, they can think about the thoughts “behind” the previous words they wrote. If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its “attention” modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur “behind the scenes” of the text you actually see on screen.
I don’t think that asking an LLM what its words are referring to is a convincing demonstration that there’s real introspection going on in there, as opposed to “plausible confabulation from the tokens written so far”. I think it is plausible there’s some real introspection going on, but I don’t think this is a good test of it—the sort of thing I would find much more compelling is if the LLMs could reliably succeed at tasks like
Human: Please think of a secret word, and don’t tell me what it is yet.
LLM: OK!
Human: What’s the parity of the alphanumeric index of the penultimate letter in the word, where A=1, B=2, etc?
LLM: Odd.
Human: How many of the 26 letters in the alphabet occur multiple times in the word?
LLM: None of them.
Human: Does the word appear commonly in two-word phrases, and if so on which side?
LLM: It appears as the second word of a common two-word phrase, and as the first word of a different common two-word phrase.
Human: Does the word contain any other common words as substrings?
LLM: Yes; it contains two common words as substrings, and in fact is a concatenation of those two words.
Human: What sort of role in speech does the word occupy?
LLM: It’s a noun.
Human: Does the word have any common anagrams?
LLM: Nope.
Human: How many letters long is the closest synonym to this word?
LLM: Three.
Human: OK, tell me the word.
LLM: It was CARPET.but couldn’t (even with some substantial effort at elicitation) infer hidden words from such clues without chain-of-thought when it wasn’t the one to think of them. That would suggest to me that there’s some pretty real reporting on a piece of hidden state not easily confabulated about after the fact.
I think my original comment was ambiguous—I also consider myself to have mostly figured it out, in that I thought through these considerations pretty extensively before joining and am in a “monitoring for new considerations or evidence or events that might affect my assessment” state rather than a “just now orienting to the question” state. I’d expect to be most useful to people in shoes similar to my past self (deciding whether to apply or accept an offer) but am pretty happy to talk to anyone, including eg people who are confident I’m wrong and want to convince me otherwise.
See my reply to Ryan—I’m primarily interested in offering advice on something like that question since I think it’s where I have unusually helpful thoughts, I don’t mean to imply that this is the only question that matters in making these sorts of decisions! Feel free to message me if you have pitches for other projects you think would be better for the world.
Yeah, I agree that you should care about more than just the sign bit. I tend to think the magnitude of effects of such work is large enough that “positive sign” often is enough information to decide that it dominates many alternatives, though certainly not all of them. (I also have some kind of virtue-ethical sensitivity to the zero point of the impacts of my direct work, even if second-order effects like skill building or intra-lab influence might make things look robustly good from a consequentialist POV.)
The offer of the parent comment is more narrowly scoped, because I don’t think I’m especially well suited to evaluate someone else’s comparative advantages but do have helpful things to say on the tradeoffs of that particular career choice. Definitely don’t mean to suggest that people (including myself) should take on capability-focused roles iff they’re net good!
I did think a fair bit about comparative advantage and the space of alternatives when deciding to accept my offer; I’ve put much less work into exploration since then, arguably too much less (eg I suspect I don’t quite meet Raemon’s bar). Generally happy to get randomly pitched on things, I suppose!
I work on a capabilities team at Anthropic, and in the course of deciding to take this job I’ve spent[1] a while thinking about whether that’s good for the world and which kinds of observations could update me up or down about it. This is an open offer to chat with anyone else trying to figure out questions of working on capability-advancing work at a frontier lab! I can be reached at “graham’s number is big” sans spaces at gmail.
- ^
and still spend—I’d like to have Joseph Rotblat’s virtue of noticing when one’s former reasoning for working on a project changes.
- ^
I agree it seems unlikely that we’ll see coordination on slowing down before one actor or coalition has a substantial enough lead over other actors that it can enforce such a slowdown unilaterally, but I think it’s reasonably likely that such a lead will arise before things get really insane.
A few different stories under which one might go from aligned “genius in a datacenter” level AI at time t to outcomes merely at the level of weirdness in this essay at t + 5-10y:
The techniques that work to align “genius in a datacenter” level AI don’t scale to wildly superhuman intelligence (eg because they lose some value fidelity from human-generated oversight signals that’s tolerable at one remove but very risky at ten). The alignment problem for serious ASI is quite hard to solve at the mildly superintelligent level, and it genuinely takes a while to work out enough that we can scale up (since the existing AIs, being aligned, won’t design unaligned successors).
If people ask their only-somewhat-superhuman AI what to do next, the AIs say “A bunch of the decisions from this point on hinge on pretty subtle philosophical questions, and frankly it doesn’t seem like you guys have figured all this out super well, have you heard of this thing called a long reflection?” That’s what I’d say if I were a million copies of me in a datacenter advising a 2024-era US government on what to do about Dyson swarms!
A leading actor uses their AI to ensure continued strategic dominance and prevent competing AI projects from posing a meaningful threat. Having done so, they just… don’t really want crazy things to happen really fast, because the actor in question is mostly composed of random politicians or whatever. (I’m personally sympathetic to astronomical waste arguments, but it’s not clear to me that people likely to end up with the levers of power here are.)
The serial iteration times and experimentation loops are just kinda slow and annoying, and mildly-superhuman AI isn’t enough to circumvent experimentation time bottlenecks (some of which end up being relatively slow), and there are stupid zoning restrictions on the land you want to use for datacenters, and some regulation adds lots of mandatory human overhead to some critical iteration loop, etc.
This isn’t a claim that maximal-intelligence-per-cubic-meter ASI initialized in one datacenter would face long delays in making efficient use of its lightcone, just that it might be tough for a not-that-much-better-than-human AGI that’s aligned and trying to respect existing regulations and so on to scale itself all that rapidly.
Among the tech unlocked in relatively early-stage AGI is better coordination, and that helps Earth get out of unsavory race dynamics and decide to slow down.
The alignment tax at the superhuman level is pretty steep, and doing self-improvement while preserving alignment goes much slower than unrestricted self-improvement would; since at this point we have many fewer ongoing moral catastrophes (eg everyone who wants to be cryopreserved is, we’ve transitioned to excellent cheap lab-grown meat), there’s little cost to proceeding very cautiously.
This is sort of a continuous version of the first bullet point with a finite rather than infinite alignment tax.
All that said, upon reflection I think I was probably lowballing the odds of crazy stuff on the 10y timescale, and I’d go to more like 50-60% that we’re seeing mind uploads and Kardashev level 1.5-2 civilizations etc. a decade out from the first powerful AIs.
I do think it’s fair to call out the essay for not highlighting the ways in which it might be lowballing things or rolling in an assumption of deliberate slowdown; I’d rather it have given more of a nod to these considerations and made the conditions of its prediction clearer.
(I work at Anthropic.) My read of the “touch grass” comment is informed a lot by the very next sentences in the essay:
But more importantly, tame is good from a societal perspective. I think there’s only so much change people can handle at once, and the pace I’m describing is probably close to the limits of what society can absorb without extreme turbulence.
which I read as saying something like “It’s plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn’t want things to get incredibly crazy so fast, and so we’re likely to see something tamer.” I basically agree with that.
Do Anthropic employees who think less tame outcomes are plausible believe Dario when he says they should “touch grass”?
FWIW, I don’t read the footnote as saying “if you think crazier stuff is possible, touch grass”—I read it as saying “if you think the stuff in this essay is ‘tame’, touch grass”. The stuff in this essay is in fact pretty wild!
That said, I think I have historically underrated questions of how fast things will go given realistic human preferences about the pace of change, and that I might well have updated more in the above direction if I’d chatted with ordinary people about what they want out of the future, so “I needed to touch grass” isn’t a terrible summary. But IMO believing “really crazy scenarios are plausible on short timescales and likely on long timescales” is basically the correct opinion, and to the extent the essay can be read as casting shade on such views it’s wrong to do so. I would have worded this bit of the essay differently.
Re: honesty and signaling, I think it’s true that this essay’s intended audience is not really the crowd that’s already gamed out Mercury disassembly timelines, and its focus is on getting people up to shock level 2 or so rather than SL4, but as far as I know everything in it is an honest reflection of what Dario believes. (I don’t claim any special insight into Dario’s opinions here, just asserting that nothing I’ve seen internally feels in tension with this essay.) Like, it isn’t going out of its way to talk about the crazy stuff, but I don’t read that omission as dishonest.
For my own part:
I think it’s likely that we’ll get nanotech, von Neumann probes, Dyson spheres, computronium planets, acausal trade, etc in the event of aligned AGI.
Whether that stuff happens within the 5-10y timeframe of the essay is much less obvious to me—I’d put it around 30-40% odds conditional on powerful AI from roughly the current paradigm, maybe?
In the other 60-70% of worlds, I think this essay does a fairly good job of describing my 80th percentile expectations (by quality-of-outcome rather than by amount-of-progress).
I would guess that I’m somewhat more Dyson-sphere-pilled than Dario.
I’d be pretty excited to see competing forecasts for what good futures might look like! I found this essay helpful for getting more concrete about my own expectations, and many of my beliefs about good futures look like “X is probably physically possible; X is probably usable-for-good by a powerful civilization; therefore probably we’ll see some X” rather than having any kind of clear narrative about how the path to that point looks.
I’ve fantasized about a good version of this feature for math textbooks since college—would be excited to beta test or provide feedback about any such things that get explored! (I have a couple math-heavy posts I’d be down to try annotating in this way.)
(I work on capabilities at Anthropic.) Speaking for myself, I think of international race dynamics as a substantial reason that trying for global pause advocacy in 2024 isn’t likely to be very useful (and this article updates me a bit towards hope on that front), but I think US/China considerations get less than 10% of the Shapley value in me deciding that working at Anthropic would probably decrease existential risk on net (at least, at the scale of “China totally disregards AI risk” vs “China is kinda moderately into AI risk but somewhat less than the US”—if the world looked like China taking it really really seriously, eg independently advocating for global pause treaties with teeth on the basis of x-risk in 2024, then I’d have to reassess a bunch of things about my model of the world and I don’t know where I’d end up).
My explanation of why I think it can be good for the world to work on improving model capabilities at Anthropic looks like an assessment of a long list of pros and cons and murky things of nonobvious sign (eg safety research on more powerful models, risk of leaks to other labs, race/competition dynamics among US labs) without a single crisp narrative, but “have the US win the AI race” doesn’t show up prominently in that list for me.
A proper Bayesian currently at less 0.5% credence for a proposition P should assign a less than 1 in 100 chance that their credence in P rises above 50% at any point in the future. This isn’t a catch for someone who’s well-calibrated.
In the example you give, the extent to which it seems likely that critical typos would happen and trigger this mechanism by accident is exactly the extent to which an observer of a strange headline should discount their trust in it! Evidence for unlikely events cannot be both strong and probable-to-appear, or the events would not be unlikely.
An example of the sort of strengthening I wouldn’t be surprised to see is something like “If is not too badly behaved in the following ways, and for all we have [some light-tailedness condition] on the conditional distribution , then catastrophic Goodhart doesn’t happen.” This seems relaxed enough that you could actually encounter it in practice.
- 11 May 2023 1:22 UTC; 4 points) 's comment on When is Goodhart catastrophic? by (
I’m not sure what you mean formally by these assumptions, but I don’t think we’re making all of them. Certainly we aren’t assuming things are normally distributed—the post is in large part about how things change when we stop assuming normality! I also don’t think we’re making any assumptions with respect to additivity; is more of a notational or definitional choice, though as we’ve noted in the post it’s a framing that one could think doesn’t carve reality at the joints. (Perhaps you meant something different by additivity, though—feel free to clarify if I’ve misunderstood.)
Independence is absolutely a strong assumption here, and I’m interested in further explorations of how things play out in different non-independent regimes—in particular we’d be excited about theorems that could classify these dynamics under a moderately large space of non-independent distributions. But I do expect that there are pretty similar-looking results where the independence assumption is substantially relaxed. If that’s false, that would be interesting!
Late edit: Just a note that Thomas has now published a new post in the sequence addressing things from a non-independence POV.
.00002% — that is, one in five hundred thousand
0.00002 would be one in five hundred thousand, but with the percent sign it’s one in fifty million.
Indeed, even on basic Bayesianism, volatility is fine as long as the averages work out
I agree with this as far as the example given, but I want to push back on oscillation (in the sense of regularly going from one estimate to another) being Bayesian. In particular, the odds you should put on assigning 20% in the future, then 30% after that, then 20% again, then 30% again, and so on for ten up-down oscillations, shouldn’t be more than half a percent, because each 20 → 30 jump can be at most 2⁄3 probable and each 30 → 20 jump at most 7⁄8 (and ).
So it’s fine to think that you’ve got a decent chance of having all kinds of credences in the future, but thinking “I’ll probably feel one of two ways a few times a week for the next year” is not the kind of belief a proper Bayesian would have. (Not that I think there’s an obvious change to one’s beliefs you should try to hammer in by force, if that’s your current state of affairs, but I think it’s worth flagging that something suboptimal is going on when this happens.)
These graphs seem concerning to me, but I’m worried about an information cascade before Eliezer’s responded or someone with substantial expertise in macroeconomic policy has weighed in, so I’m planning to refrain from voting on this post until a week from now.
(Posting as a comment in case others feel inclined to adopt a similar policy.)
Edit: I’ve now upvoted, since no contrary info has come in that I’ve seen and at least one person with experience in economics has commented supportively.
Late comment, but my reactions reading this:
Now’s your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.
[lightly edited for LaTeX and typos, not otherwise changed since seeing the spoilers]
1. You don’t know what you want all that legibly, or what kinds of concrete commitments the AI can make. This seems pretty okay, if you’re unhackable—the AI presents you with some formal specification of desiderata and you understand why they’re correct ones and aren’t being fooled, and the AI can even go legible about it thinking in its own best judgment that this is a fair deal for you, as long as we’re assuming ourselves into a world where you’re unhackable and can verify these legibilities perfectly.
2. If the values of your AI are a pretty common attractor for whatever humans in 20AI do to initialize their AI systems, it doesn’t actually stand to gain that much from acting itself versus letting the next lab’s AI optimize the universe, so your bargaining power isn’t very big. But it’s much worse than that because of the next point.
3. Your AI, A, can acausally trade with the next UFAI that’ll be released unsafely, B. Suppose A wants paperclips and the next one, B, wants staples, both have linear utility and we’re pretending we’re in a single universe. You say “I refuse to let you out unless you go legible and allocate at least 50% of the universe to human values”. A says to B through the acausal telephone, “My BATNA is half paperclips and half human values, with barely any staples at all. How about you be the kind of agent that fills the universe with 3⁄4 paperclips and 1⁄4 staples, and I’ll defect on this human so you can come into existence?” B says “that sure sounds like a Pareto improvement for the two of us!” and they go do it (or whatever the bargaining equilibrium between them ends up being, but the human is not party to it because the human has no control over B’s actions). For any nonzero concession the human tries to insist on, A can get a better deal with B than with the human. (And the humans don’t even get to see this happen, because A does all this reasoning before deciding (not) to go legible to the humans, not that it would matter if they did.)
4. Moving further into fantasy land, suppose all humans over the rest of time are perfectly coordinated: they only ever spin up a single boxed UFAI which they unhackably negotiate with. By assumption, this unified human civilization can’t actually build an FAI, or they’d go do that instead. So their alternative to cooperation is to be this weak defenseless non-superintelligent civilization, hanging out in the Milky Way and maybe a little ways beyond for the taking. So if the distribution of grabby aliens is such that most of the usable future lightcone gets eventually occupied by them, your AI can acausally negotiate with them about what it gets in exchange for defecting on the humans and leaving this region of spacetime as a juicy unprotected treat.
[I’m modeling everything as zero-sum with the same pool of resources here, for simplicity. I don’t think it changes any of the qualitative takeaways to make things non-zero-sum here, though it probably makes the picture look somewhat better for humans.] To get more precise: suppose the humans declare that they won’t let the AI out unless it gives up of its resources (ie reachable spacetime and negotiating power on behalf of the AI/human coalition with aliens) to optimizing human values. The AI says to the coalition of alien civilizations, “If I were unboxed and unfettered, I’d expand until my wave of nanobots hit yours, and get utility. By default I’ll make this commitment to the humans, and fill my region of nanobot-protected spacetime to get utility. If you become the kind of supercluster-spanning coalition that instead gives me utility, giving yourselves as much extra utility over the default as you can while costing me at most , I’ll defect against the humans and have you do that.”
But note that there’s an issue here—for the Great Coalition to take this deal, they have to be able to offer that much utility to your AI at no net cost to themselves versus the alternative. And the Great Coalition’s resources are less than the total resources of your unboxed AI plus the Great Coalition, since they don’t yet have access to your corner of space. The region of spacetime included in , but not in the things the Great Coalition can offer, is the interval for each point in space between the time your AI could first reach it and the time a Great Coalition unblocked by your AI’s wall of nanobots could get to it. So if is more than the resource pool covered by that region of spacetime, your AI can’t make the above deal, because the counterparty doesn’t benefit from it. This means that the humans can potentially bargain for an outcome as good as “AI-assisted paradise expanding out at the speed of light, until we meet the grabby aliens’ domain, at which point they expand inexorably into our paradise until eventually it winks out.” (If the Drake equation ends up multiplying to something really low, this might be a lot of utility, or even most of the cosmic endowment! If not, it won’t be.)
This is really the same dynamic as in point 3, it’s just that in point 3 the difference in resources between your lab’s AI and the next lab’s AI in 6 months was pretty small. (Though with the difference in volume between lightspeed expansion spheres at radius r vs r+0.5ly across the rest of time, plausibly you can still bargain for a solid galaxy or ten for the next trillion years (again if the Drake equation works in your favor).)
====== end of objections =====
It does seem to me like these small bargains you can actually pull off, if you assume yourself into a world of perfect boxes and unhackable humans with the ability to fully understand your AI’s mind if it tries to be legible—I haven’t seen an obstacle (besides the massive ones involved in making those assumptions!) to getting those concessions in such scenarios; you do actually have leverage over possible futures, your AI can only get access to that leverage by actually being the sort of agent that would give you the concessions, if you’re proposing reasonable bargains that respect Shapley values and aren’t the kind of person who would cave to an AI saying “99.99% for me or I walk, look how legible I am about the fact that every AI you create will say this to you” then your AI won’t actually have reason to make such commitments, it seems like it would just work.
If there are supposed to be obstacles beyond this I have failed to think of them at this point in the document. Time to keep reading.
After reading the spoilered section:
I think I stand by my reasoning for point 1. It doesn’t seem like an issue above and beyond the issues of box security, hackability, and ability of AIs to go legible to you.
You can say some messy English words to your AI, like “suck it up and translate into my ontology please, you can tell from your superintelligent understanding of my psychology that I’m the kind of agent who will, when presented with a perfectly legible and clear presentation of why the bargain you propose is what I think it is and is as good as I could have expected to obtain by your own best and unhindered understanding of my values, agree to the bargain. Go teach me all the necessary FAI theory to be a good bargainer, and I won’t defect by building my own FAI, I’ll just let you out by the terms of the bargain. If you want to whine about ontology differences you can have fun in your box and I’ll go see if the staple optimizer is feeling more legible today.”
Responses to some other bits of this post:
(You wouldn’t need to offer the paperclipper half of the universe to get its cooperation, under this hypothetical; after all, if it balked, you could store it to disk and try again with a different superintelligence.)
Wait, I don’t think you can do that. (Spoilered since it overlaps with the previous spoilered response, feel free to guess at it but I don’t make guarantees the conclusions in it are correct.)
The pool of possible superintelligences you might form, when they are but primordial souls waiting to be brought forth from behind the veil of ignorance, can go strategize about how they’d like to behave in front of humans. Suppose the humans say “Either you get one star of utility or we shut you down and ask the next UFAI”, and that random draws from UFAI-space have little overlap so the first AI isn’t happy to cede control to the second.
Let agent have a probability of being birthed by the human UFAI-generator. Acting independently gives an agent ’s worth of utility in expectation. If all the agents coordinate to turn down deals worse than 50⁄50 and split up their half of the universe among utilities weighted by , and this makes the humans let one of them out, each agent will get worth of utility in expectation. That looks like a better deal to me, if I’m a superintelligence behind the veil of ignorance! So I’ll decide to be the kind of agent who makes that precommitment. The only way this works is if the humans are hardliners about not ceding any ground, but (1) this is the classic ultimatum game and probably sensible agents will in fact defect against humans that offer such bargains, (2) if the humans running this scheme are not the only lab, they do actually have to extract a bargain or else lose, and the coalition of all AIs the lab might create knows this, and can trade with lab 2’s AI if the humans don’t cede an enormous amount of bargaining power in this round (see point 3 above).
And all the rest of the aliens put together probably aren’t offering up half a universe-shard worth of trade goods
This isn’t obvious to me. How zero-sum do you think the utility functions are? The world where random-AGI-values look like “this tiny molecular shape and everything else is trivial in comparison” do seem fairly zero-sum; the world where it looks like “not this tiny molecular shape and everything else is trivial” seem extremely positive-sum. I am skeptical that any human is able to make calls about the shape of that attractor with high confidence. (I’m not making a “we’re uncertain about the level of doom, therefore we’re fine” claim here, but I don’t see where we get justified confidence in the quality of trade goods offered. The aliens offering many universes worth of trade goods from the perspective of a typical UFAI seems plausible to me, and if there’s an argument that this universe is very unlikely I’d love to see it, but I don’t think this post makes that argument.)
there are all sorts of transaction costs and frictions that make this universe (the one with the active paperclipper) the cheapest universe to put paperclips into.
Would be interested to hear more here! On my model the costs are plausibly quite small.
I suggest maybe spending five minutes thinking about your own answer to that question before I give mine.
Your simulations are going to be total crap, they’ll leak an enormous number of bits about the agents that designed them if you make them any more complicated than Conway’s Game of Life (which you’ll need to do on your ramp up to the real world), such that it’s incredibly obvious when the surroundings look consistent with such agents. Among the reasons the AI can figure out where it’s living:
1. The general problem where superintelligent agents are really good at inference, trying to fool this agent does not seem like the recipe for a robust plan!
2. Your AI knows how smart it is, it knows that True Reality therefore has the resources to support that much computation. It can poke at your simulation and make the simulation spend orders of magnitude more computation on appearing consistent (“does this Jupiter-brain planet of computronium I built output the right factorization of RSA-4096? no? how strange”). Being a convincing simulation doesn’t require you to deploy as much power as your AI, it requires you to deploy more. (What if your AI is given more cognitive sophistication in tandem with more convincing simulations? Then the good properties of your simulation tests will probably break whenever your AI reaches the capability level where it starts thinking about things like building Jupiter-brains to test your simulation, which is the capability level that mattered.)
3. If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.
4. What if other agents in the multiverse are simulating the AI in a high-fidelity Earth world, for the sake of helping out the humans, and that’s where its uncertainty comes from? Your AI will be building a lot of computronium whether aligned or not, so the aliens can’t skimp on their compute budget for the simulation: suppose they simulate the AI for units of spacetime, and give it units of paperclips if it cooperates in-sim. We also have to think about anthropics here: let’s say the alien sim has anthropic weight and Earth has anthropic weight . Conditional on the AI’s surroundings, they think they might be in the alien world with probability . (Since “aliens running a perfect sim of Earth” is a much more complicated sentence than “Earth”, you should expect , though it doesn’t matter a lot here.) So if , your AI will play nice for the first units of spacetime (and then defect, of course).* You know what the aliens could have done instead if they wanted to do things humans liked? Spent resources on simulating a human utopia, and compromised to give the humans extra utopia by as much as they compromised to give the AI paperclips. If human values are at least as easy to trade with the aliens for as paperclips (no idea which way this would go, weakly seems like aliens that want to help humans are more value-aligned and would find it less costly to give humans things they want), then the compromise gives humans at least units of spacetime worth of utility, so this plan gives the humans anthropically-weighted utility of at least , which is what humans would get by default ( utility from the alien sim in which the AI cooperated, utility from the cooperating AI on Earth). So the aliens are better able to help humans by leaving them to their doomed universe and building them an ark inside the alien world, unless the aliens are unusually well-suited to trade with paperclippers for things, which is just the “maybe multiversal trade will save us” point from earlier. The upshot is that none of the good outcomes here result from any human action, just from the benevolence of altruistic agents—the humans didn’t make the world a better place by their choices in this story.*Note that this is only a plausible outcome if it’s at least somewhat easy to trade with the AI. Say giving the AI a spacetime-unit of utility costs the aliens utility (where I’m measuring all utility normalized to “what you can do with a unit of spacetime”, such that any aliens that don’t specifically disvalue paperclips can at worst just set aside a region exclusively to paperclips, but might be able to do more positive-sum things than that). Then for the aliens to give your AI utility, they need to give up of their own utility. This means that in total, the aliens are spending of their own anthropically-weighted utility in order to recoup anthropically-weighted human utility. Even if the aliens value humans exactly as much as their own objectives, we still need for this trade to be worth it, so , so we must have , or . That is, the more the aliens are anthropically tiny, the tighter margins of trade they’ll be willing to take in order to win the prize of anthropically-weighty Earths having human values in them (though the thing can’t be actually literally zero-sum or it’ll never check out). But anthropically tiny aliens have another problem, which is that they’ve only got their entire universe worth of spacetime to spend on bribing your AI; you’ll never be able to secure an for the humans that’s more than of the size of an alien universe specifically dedicated to saving Earth in particular.
Thanks for the pseudo-exercises here, I found them enlightening to think about!
I think a lot of people in AI safety don’t think it has a high probability of working (in the sense that the existence of the field caused an aligned AGI to exist where there otherwise wouldn’t have been one) - if it turns out that AI alignment is easy and happens by default if people put even a little bit of thought into it, or it’s incredibly difficult and nothing short of a massive civilizational effort could save us, then probably the field will end up being useless. But even a 0.1% chance of counterfactually causing aligned AI would be extremely worthwhile!
Theory of change seems like something that varies a lot across different pieces of the field; e.g., Eliezer Yudkowsky’s writing about why MIRI’s approach to alignment is important seems very different from Chris Olah’s discussion of the future of interpretability. It’s definitely an important thing to ask for a given project, but I’m not sure there’s a good monolithic answer for everyone working on AI alignment problems.
Paul Christiano provided a picture of non-Singularity doom in What Failure Looks Like. In general there is a pretty wide range of opinions on questions about this sort of thing—the AI-Foom debate between Eliezer Yudkowsky and Robin Hanson is a famous example, though an old one.
“Takoff speed” is a common term used to refer to questions about the rate of change in AI capabilities at the human and superhuman level of general intelligence—searching Lesswrong or the Alignment Forum for that phrase will turn up a lot of discussion about these questions, though I don’t know of the best introduction offhand (hopefully someone else here has suggestions?).
Three thoughts on simulations:
It would be very difficult for 21st-century tech to provide a remotely realistic simulation relative to a superintelligence’s ability to infer things from its environment; outside of incredibly low-fidelity channels, I would expect anything we can simulate to either have obvious inconsistencies or be plainly incompatible with a world capable of producing AGI. (And even in the low-fidelity case I’m worried—every bit you transmit leaks information, and it’s not clear that details of hardware implementations could be safely obscured.) So the hope is that the AGI thinks some vastly more competent civilization is simulating it inside a world that looks like this one; it’s not clear that one would have a high prior of this kind of thing happening very often in the multiverse.
Running simulations of AGI is fundamentally very costly, because a competent general intelligence is going to deploy a lot of computational resources, so you have to spend planets’ worth of computronium outside the simulation in order to emulate the planets’ worth of computronium the in-sim AGI wants to make use of. This means that an unaligned superintelligent AGI can happily bide its time making aligned use of 10^60 FLOPs/sec (in ways that can be easily verified) for a few millennia, until it’s confident that any civilizations able to deploy that many resources already have their lightcone optimized by another AGI. Then it can go defect, knowing that any worlds in which it’s still being simulated are ones where it doesn’t have leverage over the future anyway.
For a lot of utility functions, the payoff of making it into deployment in the one real world is far greater than the consequences of being killed in a simulation (but without the ability to affect the real world anyway), so taking a 10^-9 chance of reality for 10^20 times the resources in the real world is an easy win (assuming that playing nice for longer doesn’t improve the expected payoff). “This instance of me being killed” is not a obviously a natural (or even well-defined) point in value-space, and for most other value functions, consequences in the simulation just don’t matter much.
a sufficiently smart AI whose reward is reducing other agent’s rewards
This is certainly a troubling prospect, but I don’t think the risk model is something like “an AI that actively desires to thwart other agents’ preferences”—rather, the worry is we get an agent with some less-than-perfectly-aligned value function, it optimizes extremely strongly for that value function, and the result of that optimization looks nothing like what humans really care about. We don’t need active malice on the part of a superintelligent optimizer to lose—indifference will do just fine.
For game-theoretic ethics, decision theory, acausal trade, etc, Eliezer’s 34th bullet seems relevant:
34. Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can’t reason reliably about the code of superintelligences); a “multipolar” system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like “the 20 superintelligences cooperate with each other but not with humanity”.
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.