Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker’s Rules.
Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker’s Rules.
Fantastic snapshot. I wonder (and worry) whether we’ll look back on it with similar feelings as those we have for What 2026 looks like now.
There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.
This makes it really unclear what to work on.
It’s not super obvious to me that there won’t be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they’re upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like “a simple cooridnation mechanism”, and more like “a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements”. Of course, similarity to past geopolitical situations does make it seem unlikely on priors.
There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.
My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.
I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I’m not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won’t happen
Speaking for myself (not my coauthors), I don’t agree with your two items, because:
if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it “generally uses too many try-excepts”
More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obvious capability evaluation techniques.
That said, I do agree systematic inclusion of considerations about negative externalities should be a norm, and thus we should have done so. I will shortly say now that a) behavioral self-awareness seems differentially more relevant to alignment than capabilities, and b) we expected lab employees to find out about this themselves (in part because this isn’t surprising given out-of-context reasoning), and we in fact know that several lab employees did. Thus I’m pretty certain the positive externalities of building common knowledge and thinking about alignment applications are notably bigger.
Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I’m probably more optimistic about new ideas than you, partly because “it always subjectively feels like there are no big ideas left”, and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to worry about international competition, but will you feel so closely tied that you can’t even spare that much? My guess would be no. That said, I still don’t expect certain lab leaders to want to do this.
The same is not true of security though, that’s a tough one.
See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it’s unclear if it can be made to scale.
I have two main problems with t-AGI:
A third one is a definitory problem exacerbated by test-time compute: What does it mean for an AI to succeed at task T (which takes humans X hours)? Maybe it only succeeds when an obscene amount of test-time compute is poured. It seems unavoidable to define things in terms of resources as you do
Very cool! But I think there’s a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:
Say you are going to use Process X to obtain a new Model. Process X can be as simple as “pre-train on this dataset”, or as complex as “use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they reject delete everything”. Whatever Process X is, you have only two ways to obtain evidence that Process X has a particular property (like “safety”): looking a priori at the spec of Process X (without running it), or running (parts of) Process X and observing its outputs a posteriori. In the former case, you clearly need an argument for why this particular spec has the property. But in the latter case, you also need an argument for why observing those particular outputs ensures the property for this particular spec. (Pedantically speaking, this is just Kuhn’s theory-ladenness of observations.)
Of course, the above reasoning doesn’t rule out the possibility that the required arguments are pretty trivial to make. That’s why you summarize some well-known complications of automation, showing that the argument will not be trivial when Process X contains a lot of automation, and in fact it’d be simpler if we could do away with the automation.
It is also the case that the outputs observed from Process X might themselves be human-readable arguments. While this could indeed alleviate the burden of human argument-generation, we still need a previous (possibly simpler) argument for why “a human accepting those output arguments” actually ensures the property (especially given those arguments could be highly out-of-distribution for the human).
My understanding from discussions with the authors (but please correct me):
This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.
Maybe it’s easiest if I explain what this post grows out of:
There seems to be a widespread vibe amongst rationalists that “one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply win”. This vibe is no coincidence, since Eliezer and Nate, in some of their writing about FDT, use language strongly implying that decision theory A is objectively better than decision theory B because it just wins more. Unfortunately, this intuitive notion of winning cannot actually be made into a philosophically valid objective metric. (In more detail, a precise definition of winning is already decision-theory-complete, so these arguments beg the question.) This point is well-known in philosophical academia, and was already succinctly explained in a post by Caspar (which the authors mention).
In the current post, the authors extend a similar philosophical critique to other widespread uses of winning, or background assumptions about rationality. For example, some people say that “winning is about not playing dominated strategies”… and the authors agree about avoiding dominated strategies, but point out that this is not too action-guiding, because it is consistent with many policies. Or also, some people say that “rationality is about implementing the heuristics that have worked well in the past, and/or you think will lead to good future performance”… but these utterances hide other philosophical assumptions, like assuming the same mechanisms are at play in the past and future, which are especially tenuous for big problems like x-risk. Thus, vague references to winning aren’t enough to completely pin down and justify behavior. Instead, we fundamentally need additional constraints or principles about normativity, what the authors call non-pragmatic principles. Of course, these principles cannot themselves be justified in terms of past performance (which would lead to circularity), so they instead need to be taken as normative axioms (just like we need ethical axioms, because ought cannot be derived from is).
GEB
Like Andrew, I don’t see strong reasons to believe that near-term loss-of-control accounts for more x-risk than medium-term multi-polar “going out with a whimper”. This is partly due to thinking oversight of near-term AI might be technically easy. I think Andrew also thought along those lines: an intelligence explosion is possible, but relatively easy to prevent if people are scared enough, and they probably will be. Although I do have lower probabilities than him, and some different views on AI conflict. Interested in your take @Daniel Kokotajlo
You know that old thing where people solipsistically optimizing for hedonism are actually less happy? (relative to people who have a more long-term goal related to the external world) You know, “Whoever seeks God always finds happiness, but whoever seeks happiness doesn’t always find God”.
My anecdotal experience says this is very true. But why?
One explanation could be in the direction of what Eliezer says here (inadvertently rewarding your brain for suboptimal behavior will get you depressed):
Someone with a goal has an easier time getting out of local minima, because it is very obvious those local minima are suboptimal for the goal. For example, you get out of bed even when the bed feels nice. Whenever the ocasional micro-breakdown happens (like feeling a bit down), you power through for your goal anyway (micro-dosing suffering as a consequence), so your brain learns that micro-breakdowns only ever lead to bad immediate sensations and fixes them fast.
Someone whose only objective is the satisfaction of their own appetites and desires has a harder time reasoning themselves out of local optima. Sure, getting out of bed allows me to do stuff that I like. But those feel distant now, and the bed now feels comparably nice… You are now comparing apples to apples (unlike someone with an external goal), and sometimes you might choose the local optimum. When the ocasional micro-breakdown happens, you are more willing to try to soften the blow and take care of the present sensation (instead of getting over the bump quickly), which rewards in the wrong direction.
Another possibly related dynamic: When your objective is satisfying your desires, you pay more conscious attention to your desires, and this probably creates more desires, leading to more unsatisfied desires (which is way more important than the amount of satisfied desires?).
hahah yeah but the only point here is: it’s easier to credibly commit to a threat if executing the threat is cheap for you. And this is simply not too interesting a decision-theoretic point, just one more obvious pragmatic consideration to throw into the bag. The story even makes it sound like “Vader will always be in a better position”, or “it’s obvious that Leia shouldn’t give in to Tarkin but should give in to Vader”, and that’s not true. Even though Tarkin loses more from executing the threat than Vader, the only thing that matters for Leia is how credible the threat is. So if Tarkin had any additional way to make his commitment credible (like program the computer to destroy Alderaan if the base location is not revealed), then there would be no difference between Tarkin and Vader. The fact that “Tarkin might constantly reconsider his decision even after claiming to commit” seems like a contingent state of affairs of human brains (or certain human brains in certain situations), not something important in the grander scheme of decision theory.
The only decision-theoretic points that I could see this story making are pretty boring, at least to me.
That is: in this case at least it seems like there’s concrete reason to believe we can have some cake and eat some too.
I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can’t do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it’s misleading to imply we can surmount it. It’s great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I’ve felt this was obscured in many relevant conversations.
This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I’m using in my arguments. I’m more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I’m interested in doing work to help navigate is the tiling problem.
My point is that the theoretical work you are shooting for is so general that it’s closer to “what sorts of AI designs (priors and decision theories) should always be implemented”, rather than “what sorts of AI designs should humans in particular, in this particular environment, implement”.
And I think we won’t gain insights on the former, because there are no general solutions, due to fundamental trade-offs (“no-free-lunchs”).
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/”eye-balling”/iterating.
Excellent explanation, congratulations! Sad I’ll have to miss the discussion.
Interlocutor: Neither option is plausible. If you update, you’re not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you’re simply advising people to be delusional.
You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don’t, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?
Me: I’m not sure if that’s exactly the condition, but at least it motivates the idea that there’s some condition differentiating when we should be updateful vs updateless. I think uncertainty about “our own beliefs” is subtly wrong; it seems more like uncertainty about which beliefs we endorse.
This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can’t differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on “our own beliefs” or “which beliefs I endorse”? After all, that’s just one more part of reality (without a clear boundary separating it).
It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can’t know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: “I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I’m in the Infinite Counterlogical Mugging… then I will just eventually change my prior because I noticed I’m in the bad world!”. But then again, why would we think this update is safe? That’s just not being updateless, and losing out on the strategic gains from not updating.
Since a solution doesn’t exist in full generality, I think we should pivot to more concrete work related to the “content” (our particular human priors and our particular environment) instead of the “formalism”. For example:
Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.
I think Nesov had some similar idea about “agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination”, although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).
EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.
Hate to always be that guy, but if you are assuming all agents will only engage in symmetric commitments, then you are assuming commitment races away. In actuality, it is possible for a (meta-) commitment race to happen about “whether I only engage in symmetric commitments”.
I don’t understand your point here, explain?
Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don’t see why).
If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn’t get catastrophically inefficient conflict.
But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.
So you need to give me a reason why a commitment race doesn’t recur in the level of “choosing which of the 5 priors everyone should implement”. That is, maybe A will make a very early commitment to only every implement prior 3. As always, this is rational if A thinks the others will react a certain way (give in to the threat and implement 3). And I don’t have a reason to expect agents not to have such priors (although I agree they are slightly less likely than more common-sensical priors).
That is, as always, the commitment races problem doesn’t have a general solution on paper. You need to get into the details of our multi-verse and our agents to argue that they won’t have these crazy priors and will coordinate well.
This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.
It seems likely that in our universe there are some agents with arbitrarily high gains-from-being-hawkish, that don’t have correspondingly arbitrarily low measure. (This is related to Pascalian reasoning, see Daniel’s sequence.) For example, someone whose utility is exponential on number of paperclips. I don’t agree that the optimal outcome (according to my ethics) is for me (who’s utility is at most linear on happy people) to turn all my resources into paperclips.
Maybe if I was a preference utilitarian biting enough bullets, this would be the case. But I just want happy people.
Nice!
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn’t yet know who they were or what their values were. From that position, they wouldn’t have wanted to do future destructive commitment races.
I don’t think this solves Commitment Races in general, because of two different considerations:
Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
Less trivially, even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.
This might still mostly solve Commitment Races in our particular multi-verse. I have intuitions both for and against this bootstrapping being possible. I’d be interested to hear yours.
Just writing a model that came to mind, partly inspired by Ryan here.
Extremely good single-single alignment should be highly analogous to “current humans becoming smarter and faster thinkers”.
If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it’s more likely than not that the former wins, but it’s not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.
If this doesn’t happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.
If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it’s common knowledge.