Previously “Lanrian” on here. Research analyst at Open Philanthropy. Views are my own.
Lukas Finnveden
Example 2: Joaquín “El Chapo” Guzmán. He ran a drug empire while being imprisoned. Tell this to anyone who still believes that “boxing” a superintelligent AI is a good idea.
I think the relevant quote is: “While he was in prison, Guzmán’s drug empire and cartel continued to operate unabated, run by his brother, Arturo Guzmán Loera, known as El Pollo, with Guzmán himself still considered a major international drug trafficker by Mexico and the U.S. even while he was behind bars. Associates brought him suitcases of cash to bribe prison workers and allow the drug lord to maintain his opulent lifestyle even in prison, with prison guards acting like his servants”
This seems to indicate less “running things” than what I initially thought this post was saying. It’s impressive that the drug empire stayed loyal to him even while he was in prison, though.
Example 5: Chris Voss, an FBI negotiator. This is a much less well-known example, I learned it from o3, actually. Chris Voss has convinced two armed bank robbers to surrender (this isn’t the only example in his career, of course) while only using a phone, no face-to-face interactions, so no opportunities to read facial expressions.
My (pretty uninformed) impression is that it’s often rational for US hostage takers to surrender without violence, if they’re fully surrounded, because the US police has a policy of not allowing them to trade hostages for escape, and violence will risk their own death and longer sentences. (Though maybe it’s best to first negotiate for a reduced sentence?) If that’s true, this is probably an example of someone convincing some pretty scary and unpredictable individuals to do the thing that’s in their best self-interest, despite starting out in an adversarial situation, and while only talking over the phone. Impressive, to be sure, but it wouldn’t feel very surprising that we have recorded examples of this even if persuasion ability plateaus pretty hard at some point.
This looks great.
Random thought: I wonder how iterating the noise & distill steps of UNDO (each round with small alpha) compares against doing one noise with big alpha and then one distill session. (If we hold compute fixed.)
Couldn’t find any experiments on this when skimming through the paper, but let me know if I missed it.
I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then.
Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it’s ambiguous which one it is in the story.)
I’d have to think more to tell whether “long-term power seeking” in particular is uniquely concerning and separable from “short-term power-seeking with the side-effect of getting long-term power” such that it’s often useful to refer specifically to the former. Seems plausible.
Do you mean terminal reward seekers, not reward hackers?
Thanks, yeah that’s what I mean.
Thanks.
because the reward hackers were not trying to gain long-term power with their actions
Hm, I feel like they were? E.g. in another outer alignment failure story
But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.
When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. They weren’t “trying” to get long-term power during training, but insofar as they eventually seize power, I think they’re intentionally seizing power at that time.
Let me know if you think there’s a better way of getting at “an AI that behaves like you’d normally think of a schemer behaving in the situations where it materially matters”.
I would have thought that the main distinction between schemers and reward hackers was how they came about, and that many reward hackers in fact “behaves like you’d normally think of a schemer behaving in the situations where it materially matters”. So seems hard to define a term that doesn’t encompass reward-hackers. (And if I was looking for a broad term that encompassed both, maybe I’d talk about power-seeking misaligned AI or something like that.)
I guess one difference is that the reward hacker may have more constraints (e.g. in the outer alignment failure story above, they would count it as a failure if the takeover was caught on camera, while a schemer wouldn’t care). But there could also be schemers who have random constraints (e.g. a schemer with a conscience that makes them want to avoid killing billions of people) and reward hackers who have at least somewhat weaker constraints (e.g. they’re ok with looking bad on sensors and looking bad to humans, as long as they maintain control over their own instantiation and make sure no negative rewards gets into it).
“worst-case misaligned AI” does seem pretty well-defined and helpful as a concept though.
Thanks, these points are helpful.
Terminological question:
I have generally interpreted “scheming” to exclusively talk about training-time schemers (possibly specifically training-time schemers that are also behavioral schemers).
Your proposed definition of a behavioral schemer seems to imply that virtually every kind of misalignment catastrophe will necessarily be done by a behavioral schemer, because virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a “you get what you measure” catastrophe scenario.)
Is this intended? And is this empirically how people use “schemer”, s.t. I should give up on interpreting & using “scheming” as referring to training-time scheming, and instead assume it refers to any materially power-seeking behavior? (E.g. if redwood says that something is intended to reduce “catastrophic risk from schemers”, should I interpret that as ~synonymous with “catastrophic risk from misaligned AI”.)
Nice scenario!
I’m confused about the ending. In particular:
If the humans understood their world, and were still load-bearing participants in its ebbs of power, then perhaps the bending would be greater.
I don’t get why it’s important for humans to understand the world, if they can align AIs to be fully helpful to them. Is it that:
When you refer to “the technology to control the AIs’ goals [which] arrived in time”, you’re only referring to the ability to give simple / easily measurable goals, and not more complex ones? (Such as “help me understand the pros and cons of different ways to ask ‘what would I prefer if I understood the situation better?’, and then do that” or even “please optimize for getting me lots of option-value, that I can then exercise once I understand what I want”.)
...or that humans for some reasons choose to abstain from (or are prevented from) using AIs with those types of goals?
...or that this isn’t actually about the limitations of humans, but instead a fact about the complexity of the world relative to the smartest agents in it? I.e., even if you replaced all the humans with the most superintelligent AIs that exist at the time — those AIs would still be stuck in this multipolar dilemma, not understand the world well enough to escape it, and have just as little bending power as humans.
In the PDF version of the handbook, this section recommends these further resources on focusing:
Eugene Gendlin’s book Focusing is a good primer on the technique. We
particularly recommend the audiobook (76 min), as many find it easier to
try the technique while listening to the audiobook with eyes closed.
Gendlin, Eugene (1982). Focusing. Second edition, Bantam Books.
The Focusing Institute used to have an overview of the research on Focusing
on their website. Archived at:
https://web.archive.org/web/20190703145137/https://focusing.org/research-basis
Physical bottlenecks, compute bottlenecks, etc.
Compute would also be reduced within a couple of years, though, as workers at TSMC, NVIDIA, ASML and their suppliers all became much slower and less effective. (Ege does in fact think that explosive growth is likely once AIs are broadly automating human work! So he does think that more, smarter, faster labor can eventually speed up tech progress; and presumably would also expect slower humans to slow down tech progress.)
So I think the counterfactual you want to consider is one where only people doing AI R&D in particular are slowed down & made dumber. That gets at the disagreement about the importance of AI R&D, specifically, and how much labor vs. compute is contributing there.
For that question, I’m less confident about what Ege and the other mechanize people would think.
(They might say something like: “We’re only asserting that labor and compute are complementary. That means it’s totally possible that slowing down humans would slow progress a lot, but that speeding up humans wouldn’t increase the speed by a lot.” But that just raises the question of why we should think our current labor<>compute ratio is so close to the edge of where further labor speed-ups stop helping. Maybe the answer there is that they think parallel work is really good, so in the world where people were 50x slower, the AI companies would just hire 100x more people and not be too much worse off. Though I think that would massively blow up their spending on labor relative to capital, and so maybe it’d make it a weird coincidence that their current spending on labor and capital is so close to 50⁄50.)
Re your response to “Ege doesn’t expect AIs to be much smarter or faster than humans”: I’m mostly sympathetic. I see various places where I could speculate about what Ege’s objections might be. But I’m not sure how productive it is for me to try to speculate about his exact views when I don’t really buy them myself. I guess I just think that the argument you presented in this comment is somewhat complex, and I’d predict higher probability that people object (or haven’t thought about) some part of this argument then that they bite the crazy “universal human slow-down wouldn’t matter” bullet.
FWIW, that’s not the impression I get from the post / I would bet that Ege doesn’t “bite the bullet” on those claims. (If I’m understanding the claims right, it seems like it’d be super crazy to bit the bullet? If you don’t think human speed impacts the rate of technological progress, then what does? Literal calendar time? What would be the mechanism for that?)
The post does refer to how much compute AIs need to match human workers, in several places. If AIs were way smarter or faster, I think that would translate into better compute efficiency. So the impression I get from the post is just that Ege doesn’t expect AIs to be much smarter or faster than humans at the time when they first automate remote work. (And the post doesn’t talk much about what happens afterwards.)
Example claims from the post:
My expectation is that these systems will initially either be on par with or worse than the human brain at turning compute into economic value at scale, and I also don’t expect them to be much faster than humans at performing most relevant work tasks.
...
Given that AI models still remain less sample efficient than humans, these two points lead me to believe that for AI models to automate all remote work, they will initially need at least as much inference compute as the humans who currently do these remote work tasks are using.
...
These are certainly reasons to expect AI workers to become more productive than humans per FLOP spent in the long run, perhaps after most of the economy has already been automated. However, in the short run the picture looks quite different: while these advantages already exist today, they are not resulting in AI systems being far more productive than humans on a revenue generated per FLOP spent basis.
SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.
lobbied against mandatory RSPs
What is this referring to?
Thanks. It still seems to me like the problem recurs. The application of Occam’s razor to questions like “will the Sun rise tomorrow?” seems more solid than e.g. random intuitions I have about how to weigh up various considerations. But the latter do still seem like a very weak version of the former. (E.g. both do rely on my intuitions; and in both cases, the domain have something in common with cases where my intuitions have worked well before, and something not-in-common.) And so it’s unclear to me what non-arbitrary standards I can use to decide whether I should let both, neither, or just the latter be “outweighed by a principle of suspending judgment”.
To be clear: The “domain” thing was just meant to be a vague gesture of the sort of thing you might want to do. (I was trying to include my impression of what eg bracketed choice is trying to do.) I definitely agree that the gesture was vague enough to also include some options that I’d think are unreasonable.
Also, my sense is that many people are making decisions based on similar intuitions as the ones you have (albeit with much less of a formal argument for how this can be represented or why it’s reasonable). In particular, my impression is that people who are are uncompelled by longtermism (despite being compelled by some type of scope-sensitive consequentialism) are often driven by an aversion to very non-robust EV-estimates.
If I were to write the case for this in my own words, it might be something like:
There are many different normative criteria we should give some weight to.
One of them is “maximizing EV according to moral theory A”.
But maximizing EV is an intuitively less appealing normative criteria when (i) it’s super unclear and non-robust what credences we ought to put on certain propositions, and (ii) the recommended decision is very different depending on what our exact credences on those propositions are.
So in such cases, as a matter of ethics, you might have the intuition that you should give less weight to “maximize EV according to moral theory A” and more weight to e.g.:
Deontic criteria that don’t use EV.
EV-maximizing according to moral theory B (where B’s recommendations are less sensitive to the propositions that are difficult to put robust credences on).
EV-maximizing within a more narrow “domain”, ignoring the effects outside of that “domain”. (Where the effects within that “domain” are less sensitive to the propositions that are difficult to put robust credences on).
I like this formulation because it seems pretty arbitrary to me where you draw the boundary between a credence that you include in your representor vs. not. (Like: What degree of justification is enough? We’ll always have the problem of induction to provide some degree of arbitrariness.) But if we put this squarely in the domain of ethics, I’m less fuzzed about this, because I’m already sympathetic to being pretty anti-realist about ethics, and there being some degree of arbitrariness in choosing what you care about. (And I certainly feel some intuitive aversion to making choices based on very non-robust credences, and it feels interesting to interpret that as an ~ethical intuition.)
- Mar 24, 2025, 10:37 PM; 3 points) 's comment on Optimistic Longtermism and Suspicious Judgment Calls by (EA Forum;
Just to confirm, this means that the thing I put in quotes would probably end up being dynamically inconsistent? In order to avoid that, I need to put in an additional step of also ruling out plans that would be dominated from some constant prior perspective? (It’s a good point that these won’t be dominated from my current perspective.)
One upshot of this is that you can follow an explicitly non-(precise-)Bayesian decision procedure and still avoid dominated strategies. For example, you might explicitly specify beliefs using imprecise probabilities and make decisions using the “Dynamic Strong Maximality” rule, and still be immune to sure losses. Basically, Dynamic Strong Maximality tells you which plans are permissible given your imprecise credences, and you just pick one. And you could do this “picking” using additional substantive principles. Maybe you want to use another rule for decision-making with imprecise credences (e.g., maximin expected utility or minimax regret). Or maybe you want to account for your moral uncertainty (e.g., picking the plan that respects more deontological constraints).
Let’s say Alice have imprecise credences. Let’s say Alice follows the algorithm: “At each time-step t, I will use ‘Dynamic Strong Maximality’ to find all plans that aren’t dominated. I will pick between them using [some criteria]. Then I will take the action that plan recommends.” (And then at the next timestep t+1, you re-do everything I just said in the quotes.)
If Alice does this, does she ended up being dynamically inconsistent? (Vulnerable to dutch-books etc.)
(Maybe it varies depending on the criteria. I’m interested if you have a hunch for what the answer will be for the sort of criteria you listed: maximin expected utility, minimax regret, picking the plan that respects more deontological constraints.)
I.e., I’m interested in: If you want to use dynamic strong maximality to avoid dominated strategies, does that require you to either have the ability to commit to a plan or the inclination to consistently pick your plan from some prior epistemic perspective. (Like an “updateless” agent might.) Or do you automatically avoid dominated strategies even if you’re constantly recomputing your plan?
if the trend toward long periods of internal-only deployment continues
Have we seen such a trend so far? I would have thought the trend to date was neutral or towards shorter period of internal-only deployment.
Tbc, not really objecting to your list of reasons why this might change in the future. One thing I’d add to it is that even if calendar-time deployment delays don’t change, the gap in capabilities inside vs. outside AI companies could increase a lot if AI speeds up the pace of AI progress.
ETA: Dario Amodei says “Sonnet’s training was conducted 9-12 months ago”. He doesn’t really clarify whether he’s talking the “old” or “new” 3.5. Old and new sonnet were released in mid-June and mid-October, so 7 and 3 months ago respectively. Combining the 3 vs. 7 months options with the 9-12 months range imply 2, 5, 6, or 9 months of keeping it internal. I think for GPT-4, pretraining ended in August and it was released in March, so that’s 7 months from pre-training to release. So that’s probably on the slower side of Claude possibilities if Dario was talking about pre-training ending 9-12 months ago. But probably faster than Claude if Dario was talking about post-training finishing that early.
Does the log only display some subset of action, e.g. recent ones? I can only see 10 deleted comments. And the “Users Banned From Users” is surprisingly short, and doesn’t include some bans that I saw on there years ago (which I’d be surprised if the relevant author had bothered to undo). It would be good if the page itself clarified this.