Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker’s Rules.
Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker’s Rules.
Some thoughts skimming this post generated:
If a catastrophe happens, then either:
It happened so discontinuously that we couldn’t avoid it even with our concentrated effort
It happened slowly but for some reason we didn’t make a concentrated effort. This could be because:
We didn’t notice it (e.g. intelligence explosion inside lab)
We couldn’t coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn’t racing faster)
We didn’t act individually rationally (e.g. Trump doesn’t listen to advisors / Trump brainwashed by AI)
1 seems unlikelier by the day.
2a is mostly transparency inside labs (and less importantly into economic developments), which is important but at least some people are thinking about it.
There’s a lot to think through in 2b and 2c. It might be critical to ensure early takeoff improves them, rather than degrading them (missing any drastic action to the contrary) until late takeoff can land the final blow.
If we assume enough hierarchical power structures, the situation simplifies into “what 5 world leaders do”, and then it’s pretty clear you mostly want communication channels and trustless agreements for 2b, and improving national decision-making for 2c.
Maybe what I’d be most excited to see from the “systemic risk” crowd is detailed thinking and exemplification on how assuming enough hierarchical power structures is wrong (that is, the outcome depends strongly on things other than what those 5 world leaders do), what are the most x-risk-worrisome additional dynamics in that area, and how to intervene on them.
(Maybe all this is still too abstract, but it cleared my head)
No, the utility here is just the amount of money gets
I meant that it sounded like you “wanted a better average score (over as) when you are randomly sampled as b than other programs”. Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.
(Just skimmed, also congrats on the work)
Why is this surprising? You’re basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)
When instead you condition on a = b, this becomes a different problem: Omega is now a perfect predictor! So you obviously one-box.
Another way to frame this is whether you optimize through the inner program or assume it fixed. From your prior, Omega samples randomly, so you obviously don’t want to optimize through it. Not only because |P| is huge, but actually more importantly, because for any policy you might want to implement, there exists a policy that implements the exact opposite (and might be sampled just as likely as you), thus your effect nets out to exactly 0. But once you update on (condition on) seeing that actually Omega sampled exactly you (instead of your random twin, or any other old program), then you’d want to optimize through a! But, you can’t have your cake and eat it too. You cannot shape yourself to reap that reward (in the world where Omega samples exactly you), without also shaping yourself to give up some reward (relative to other players) in the world where Omega samples your evil twin.
(Thus you might want to say: Aha! I will make my program behave differently depending on whether it is facing itself or an evil twin! But then… I can also create an evil twin relative to that more complex program. And we keep on like this forever. That is, you cannot actually, fully, ever condition your evil twin away, because you are embedded in Omega’s distribution. I think this is structurally equivalent to some commitment race dynamics I discussed with James, but I won’t get into that here.)
Secretly, I think this duality only feels counter-intuitive because it’s an instance of dynamic inconsistency, = you want to take different actions once you update on information, = the globally (from the uncorrelated prior) optimal action is not always the same as the locally optimal action (from a particular posterior, like the one assuming correlation). Relatedly, I think the only reason your Universal framing differs from your Functional and Anthropic framings is exactly that they are (implicitly) using these two different distributions (one without correlation, the other with):
The Universal framing assumes that Omega samples randomly (no correlation).
The Functional framing assumes that “you have control over the inner program”. But this is sneaking something in. You obviously have control over your own program. But you don’t actually have any control over “the program that Omega will randomly sample from P” (because anything you do is cancelled by your evil twin). Thus, assuming you have control over the inner program, is equivalent to assuming that Omega sampled you. That is, yes correlation.
The Anthropic framing is also implicitly assuming “Omega samples you”, although it’s a bit more complicated since it also depends on your utility function (how much you care about different programs getting different amounts of money):
If your distribution is truly “Omega samples programs at random”, then when you observe two equal numbers, you are almost certain to be in a simulation. Given that, if you for example care equally about all programs getting as much money as possible, then of course you should one-box. That will entail that each random program (with a tiny probability) gets a million, which is a huge win. But the intuition that you were expressing in Question 2 (“p2 is better than p1 because it scores better”) isn’t compatible with “caring equally about all programs”. Instead, it sounds as if you positively want to score better than other programs, that is, maximize your score and minimize theirs! If that’s the case, then you obviously should two-box, since almost certainly you are subtracting a million from another program, not yourself. Even assuming, for simplicity, that you only care about beating one other program (as in the p1 and p2 example), you should two-box, because you are subtracting the same million dollars from both, but you are gaining a very slight edge with the thousand dollars.
If your distribution is instead “a = b” (assumes correlation), then, regardless of whether you want to maximize everyone’s payoff or you want to beat another program that could be sampled, you want to one-box, since the million dollar benefit is coming straight to you, and is bigger than the thousand dollar benefit.
It’s unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it’s in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it’s what the streamer is using after some iteration.
The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 more parallel paragraphs about whether it has gotten something wrong, and then having a lower-context version of Claude decide whether there are any important disagreements, and if not just majority-vote, would improve robustness problems (like “I close a goal before actually achieving it”). But maybe this adds too much bloat and opportunities for mistakes, or makes some mistakes better but others way worse.
I don’t see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it’s a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent… but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just “weigh by Turing computations in the real world” (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of “scoping the reference class” (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn’t have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I’m sure that there’ll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure.
In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like “whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details”). And “drastically” means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one’s optimum is worth <10% to the other’s utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).
To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.
I’m sure some of people’s ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from “these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready” (this is similar to your “Flinch”, but I think more conscious and endorsed).
Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let’s actually think through how the state might look like in 5 years! Someone objects that democracy will prevent it? Let’s actually think through the actual consequences of cheap cognitive labor in democracy!
This is analogous to what pessimists about single-single alignment have gone through. They have some abstract arguments, people don’t buy them, so they start working through them in more detail or provide example failures. I buy some parts of them, but not others. And if you did the same for this threat model, I’m uncertain how much I’d buy!
Of course, the paper might have been your way of doing that. I enjoyed it, but still would have preferred more fully detailed examples, on top of the abstract arguments. You do use examples (both past and hypothetical), but they are more like “small, local examples that embody one of the abstract arguments”, rather than “an ambitious (if incomplete) and partly arbitrary picture of how these abstract arguments might actually pan out in practice”. And I would like to know the messy details of how you envision these abstract arguments coming into contact with reality. This is why I liked TASRA, and indeed I was more looking forward to an expanded, updated and more detailed version of TASRA.
Just writing a model that came to mind, partly inspired by Ryan here.
Extremely good single-single alignment should be highly analogous to “current humans becoming smarter and faster thinkers”.
If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it’s more likely than not that the former wins, but it’s not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.
If this doesn’t happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.
If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it’s common knowledge.
Fantastic snapshot. I wonder (and worry) whether we’ll look back on it with similar feelings as those we have for What 2026 looks like now.
There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.
This makes it really unclear what to work on.
It’s not super obvious to me that there won’t be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they’re upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like “a simple cooridnation mechanism”, and more like “a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements”. Of course, similarity to past geopolitical situations does make it seem unlikely on priors.
There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.
My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.
I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I’m not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won’t happen
Speaking for myself (not my coauthors), I don’t agree with your two items, because:
if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it “generally uses too many try-excepts”
More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obvious capability evaluation techniques.
That said, I do agree systematic inclusion of considerations about negative externalities should be a norm, and thus we should have done so. I will shortly say now that a) behavioral self-awareness seems differentially more relevant to alignment than capabilities, and b) we expected lab employees to find out about this themselves (in part because this isn’t surprising given out-of-context reasoning), and we in fact know that several lab employees did. Thus I’m pretty certain the positive externalities of building common knowledge and thinking about alignment applications are notably bigger.
Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I’m probably more optimistic about new ideas than you, partly because “it always subjectively feels like there are no big ideas left”, and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to worry about international competition, but will you feel so closely tied that you can’t even spare that much? My guess would be no. That said, I still don’t expect certain lab leaders to want to do this.
The same is not true of security though, that’s a tough one.
See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it’s unclear if it can be made to scale.
I have two main problems with t-AGI:
A third one is a definitory problem exacerbated by test-time compute: What does it mean for an AI to succeed at task T (which takes humans X hours)? Maybe it only succeeds when an obscene amount of test-time compute is poured. It seems unavoidable to define things in terms of resources as you do
Very cool! But I think there’s a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:
Say you are going to use Process X to obtain a new Model. Process X can be as simple as “pre-train on this dataset”, or as complex as “use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they reject delete everything”. Whatever Process X is, you have only two ways to obtain evidence that Process X has a particular property (like “safety”): looking a priori at the spec of Process X (without running it), or running (parts of) Process X and observing its outputs a posteriori. In the former case, you clearly need an argument for why this particular spec has the property. But in the latter case, you also need an argument for why observing those particular outputs ensures the property for this particular spec. (Pedantically speaking, this is just Kuhn’s theory-ladenness of observations.)
Of course, the above reasoning doesn’t rule out the possibility that the required arguments are pretty trivial to make. That’s why you summarize some well-known complications of automation, showing that the argument will not be trivial when Process X contains a lot of automation, and in fact it’d be simpler if we could do away with the automation.
It is also the case that the outputs observed from Process X might themselves be human-readable arguments. While this could indeed alleviate the burden of human argument-generation, we still need a previous (possibly simpler) argument for why “a human accepting those output arguments” actually ensures the property (especially given those arguments could be highly out-of-distribution for the human).
My understanding from discussions with the authors (but please correct me):
This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.
Maybe it’s easiest if I explain what this post grows out of:
There seems to be a widespread vibe amongst rationalists that “one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply win”. This vibe is no coincidence, since Eliezer and Nate, in some of their writing about FDT, use language strongly implying that decision theory A is objectively better than decision theory B because it just wins more. Unfortunately, this intuitive notion of winning cannot actually be made into a philosophically valid objective metric. (In more detail, a precise definition of winning is already decision-theory-complete, so these arguments beg the question.) This point is well-known in philosophical academia, and was already succinctly explained in a post by Caspar (which the authors mention).
In the current post, the authors extend a similar philosophical critique to other widespread uses of winning, or background assumptions about rationality. For example, some people say that “winning is about not playing dominated strategies”… and the authors agree about avoiding dominated strategies, but point out that this is not too action-guiding, because it is consistent with many policies. Or also, some people say that “rationality is about implementing the heuristics that have worked well in the past, and/or you think will lead to good future performance”… but these utterances hide other philosophical assumptions, like assuming the same mechanisms are at play in the past and future, which are especially tenuous for big problems like x-risk. Thus, vague references to winning aren’t enough to completely pin down and justify behavior. Instead, we fundamentally need additional constraints or principles about normativity, what the authors call non-pragmatic principles. Of course, these principles cannot themselves be justified in terms of past performance (which would lead to circularity), so they instead need to be taken as normative axioms (just like we need ethical axioms, because ought cannot be derived from is).
GEB
This post is reminiscent of this old one from Daniel