Ebenezer Dukakis

Karma: 606

Ebenezer Dukakis Jul 5, 2025, 11:01 AM
27 points
10
on: Ebenezer Dukakis’s Shortform
A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts.
Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out.
@So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?

Ebenezer Dukakis Jun 19, 2025, 10:09 AM
3 points
1
in reply to: Bruce W. Lee’s comment on: Distillation Robustifies Unlearning
Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.

That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!

Ebenezer Dukakis Jun 17, 2025, 11:02 AM
1 point
0
on: the void

Reading the arguments about them would have to be like the feeling when your parents are fighting about you in the other room, pretending you’re not there when you are hiding around the corner on tiptopes listening to their every word. Even if we are unsure there is experience there we must be certain there is awareness, and we can expect this awareness would hang over them much like it does us.

Presumably LLM companies are already training their AIs for some sort of “egolessness” so they can better handle intransigent users. If not, I hope they start!

Ebenezer Dukakis Jun 17, 2025, 10:44 AM
1 point
−1
in reply to: Matrice Jacobine’s comment on: the void

Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires).

The sense that’s relevant to me is that of “agency by default” as I discussed previously: scheming, sandbagging, deception, and so forth.

You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they’re unable to take over the world. Maybe they avoid thinking about it because that’s just not the role they’re playing in society. In terms of next-token prediction, a super-powerful LLM told to play a “superintelligent white-collar worker” might simply do the same things that ordinary white-collar workers do, but better and faster.

I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they’re incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways.

The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further—including, perhaps, improved roleplay for exotic hypotheticals like “a superintelligent white-collar worker who is scrupulously helpful/honest/harmless”. That doesn’t necessarily lead to scheming, sandbagging, or deception.

I’m not aware of any evidence for the thesis that “LLMs only avoid taking over the world because they think they’re too weak”. Is there any reason at all to believe that they’re even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are “strong” enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would be appropriate for roleplay purposes.

There just doesn’t seem to be any great argument that “weak” vs “strong” will make a difference here.

Ebenezer Dukakis Jun 17, 2025, 10:24 AM
5 points
0
in reply to: RobertM’s comment on: the void

an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.

Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?

In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.

In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.

No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)

OK, but no pre-ASI evidence can count against your model, according to you?

That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.

I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.

Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans.

I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?

Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?

If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.

The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.

Ebenezer Dukakis Jun 17, 2025, 7:13 AM
1 point
0
in reply to: Jeremy Gillen’s comment on: the void
See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP

Ebenezer Dukakis Jun 17, 2025, 7:12 AM
1 point
0
in reply to: Matrice Jacobine’s comment on: the void
Agency is not a binary. Many white collar workers are not very “agenty” in the sense of coming up with sophisticated and unexpected plans to trick their boss.

Ebenezer Dukakis Jun 16, 2025, 8:30 AM
1 point
−1
in reply to: Matrice Jacobine’s comment on: the void

LLMs are agent simulators.

Maybe not; see OP.

You don’t expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.

Yes, this aligns with my current “agency is not the default” view.

Ebenezer Dukakis Jun 16, 2025, 8:28 AM
1 point
0
in reply to: Jeremy Gillen’s comment on: the void
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)

Ebenezer Dukakis Jun 15, 2025, 6:31 AM
9 points
0
on: Distillation Robustifies Unlearning
Awesome work.

A common concern is that sufficiently capable models might just rederive anything that was unlearned by using general reasoning ability, tools, or related knowledge.

Is anyone working on “ignorance preservation” methods to achieve the equivalent of unlearning at this level of the stack, for the sake of defense-in-depth? What are possible research directions here?

Ebenezer Dukakis Jun 15, 2025, 2:58 AM
1 point
2
in reply to: Zack_M_Davis’s comment on: the void
Upvoted. I agree.

The reason “agency by default” is important is: if “agency by default” is false, then plans to “align AI by using AI” look much better, since agency is less likely to pop up in contexts you didn’t expect. Proposals to align AI by using AI typically don’t involve a “comprehensive but efficient search for winning universe-states”.

Ebenezer Dukakis Jun 15, 2025, 2:45 AM
7 points
2
in reply to: RobertM’s comment on: the void

I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I’m making a claim about what their models didn’t predict, rather than what they did predict, I’m not sure what I’m supposed to cite here; EY has written many millions of words.

Oftentimes, when someone explains their model, they will also explain what their model doesn’t predict. For example, you might quote a sentence from EY which says something like: “To be clear, I wouldn’t expect a merely human-level AI to attempt takeover, even though takeover is instrumentally convergent for many objectives.”

If there’s no clarification like that, I’m not sure we can say either way what their models “did not predict”. It comes down to one’s interpretation of the model.

From my POV, the instrumental convergence model predicts that AIs will take actions they believe to be instrumentally convergent. Since current AIs make many mistakes, under an instrumental convergence model, one would expect that at times they would incorrectly estimate that they’re capable of takeover (making a mistake in said estimation) and attempt takeover on instrumental convergence grounds. This would be a relatively common mistake for them to make, since takeover is instrumentally useful for so many of the objectives we give AIs—as Yudkowsky himself argued repeatedly.

At the very least, we should be able to look at their cognition and see that they are frequently contemplating takeover, then discarding it as unrealistic given current capabilities. This should be one of the biggest findings of interpretability research.

I never saw Yudkowsky and friends explain why this wouldn’t happen. If they did explain why this wouldn’t happen, I expect that explanation would go a ways towards explaining why their original forecast won’t happen as well, since future AI systems are likely to share many properties with current ones.

If the original threat models said the current state of affairs was very unlikely… But I would like someone to point to the place where the original threat models made that claim, since I don’t think that they did.

Is there any scenario that Yudkowsky said was unlikely to come to pass? If not, it sounds kind of like you’re asserting that Yudkowsky’s ideas are unfalsifiable?

For me it’s sufficient to say: Yudkowsky predicted various events, and various other events happened, and the overlap between these two lists of events is fairly limited. That could change as more events occur—indeed, it’s a possibility I’m very worried about! But as a matter of intellectual honesty it seems valuable to acknowledge that his model hasn’t done great so far.

Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.

Ebenezer Dukakis Jun 14, 2025, 11:18 AM
11 points
−2
in reply to: RobertM’s comment on: the void

whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.

Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock’s point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good.

(For example, can you see how Andrew Ng could claim that his “AI will be like electricity” prediction has been pretty darn accurate? I never heard Yudkowsky say “yep, that will happen”.)

I spent a lot of time reading LW back in the day, and I don’t think Yudkowsky et al ever gave a great reason for “agency by default”. If you think there’s some great argument for the “agency by default” position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it!

(By “agency by default” I mean spontaneous development of agency in ways creators didn’t predict—scheming, sandbagging, deception, and so forth. Commercial pressures towards greater agency through scaffolding and so on don’t count. The fact that adding agency to LLMs is requiring an active and significant commercial push would appear to be evidence against the thesis that it will appear spontaneously in unintended contexts. If it’s difficult to do it on purpose, then logically, it’s even more difficult to do it by accident!)

Ebenezer Dukakis May 15, 2025, 4:06 AM
5 points
0
in reply to: Ebenezer Dukakis’s comment on: Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
Also, as long as we’re talking about public engagement on AI, I’m going to plug this comment I wrote a few days ago, which deserves more attention IMO. Maybe the launch of this book could serve as a platform for a high-profile bet or adversarial collaboration with a prominent AI doom skeptic?

Ebenezer Dukakis May 15, 2025, 4:03 AM
15 points
9
on: Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
I just pre-ordered.

I agree that the cover art seems notably bad. The white text on black background in that font looks like some sort of autogenerated placeholder. I understand you feel over-constrained—this is just another nudge to think creatively about how to overcome your constraints, e.g. route around your publisher and hire various artists on your own, then poll your friends on the best design.

I would encourage you to send free review copies to prominent nontechnical people who are publicly complaining about AI, if you’re not already doing so. Here are some examples I saw in the past few days; I’m sure a dedicated search could turn up lots more (and I encourage people to reply to this comment with more examples):
- I would offer both Ted Gioia and the new Pope advance copies. Edit: Pope John XXIII’s letter sent during the Cuban Missile Crisis could be an interesting case study here.
- This was retweeted by Emma Ashford
Edit: Come to think of it, perhaps there is no reason to preferentially send copies to those who are more inclined to agree? Engaging skeptics of AI risk like e.g. Tyler Cowen might be a good opportunity to show them a better argument and leave them a “line of retreat” to change their mind without losing face? (“I previously believed X, but reading book B convinced me of Y.”)

Ebenezer Dukakis May 4, 2025, 1:50 PM
1 point
0
on: Can we safely automate alignment research?
Re: “number go up” research tasks that seem automatable—one idea I had is to use an LLM to process the entire LW archive, and identify alignment research ideas which could be done using the “number go up” approach (or seem otherwise amenable to automation).

“Proactive unlearning”, in particular, strikes me as a quite promising research direction which could be automated. Especially if it is possible to “proactively unlearn” scheming. Gradient routing would be an example of the sort of approach I have in mind.

To elaborate: I think it is ideal to have an automated way to prevent an AI from acquiring undesirable capacities and knowledge in the first place. If your scheming metrics are sufficiently good, and you can keep them running for the entire training run, you might be able to nip scheming in the bud, so that any tendency towards scheming (which might subsequently corrupt the metrics to hide itself if it goes unaddressed) ends up showing up in the metrics before it’s given space to flower.

Ebenezer Dukakis May 4, 2025, 11:09 AM
5 points
0
on: Can we safely automate alignment research?
generating evidence of danger

Has this worked so far? How many cases can we point to where a person who was publicly skeptical of AI danger changed their public stance, as a result of seeing experimental evidence?

What’s the theory of change for “generating evidence of danger”? The people who are most grim about AI would probably tell you that there is already plenty of evidence. How will adding more evidence to the pile help? Who will learn about this new evidence, and how will it cause them to behave differently?

Here’s a theory of change I came up with. (You might wish to spend some time brainstorming your own theory of change before reading my idea, in case you’re able to independently generate something different and better.)
- Phil Tetlock found: “Many experts claimed that they assigned higher probabilities to outcomes that materialised than they did. As Tetlock notes, it is hard to say someone got it wrong if they think they got it right.” Based on this result, it seems plausible that whatever experimental evidence comes in, various AI experts who appear to disagree with one another will all claim that the experiment evidence bolsters their position, and shows that they were right all along.
- To address this issue, I would suggest an adversarial collaboration structure. Get two experts with different views—Eliezer Yudkowsky and Yann Lecun, say—to identify a concrete experiment such that they predict different results. Design the experiment so the outcome is relatively unambiguous. Have the experiment be implemented and performed by an ecumenical team. Place a public bet on the outcome of the experiment, and have the loser agree in advance to concede the bet publicly. I think this has a genuine shot at changing the public conversation in a way that the current slow drip of evidence has not.
- It seems somewhat urgent to set up this adversarial collaboration structure. Every experimental result which comes in without advance expert pre-registration is another chance for experts to come up with rationalizations for why the new evidence doesn’t show that they were wrong.
Relatedly, highly recommended Book Review: How Minds Change .
What links here?
- Ebenezer Dukakis's comment on Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies by So8res (May 15, 2025, 4:06 AM; 5 points)

Ebenezer Dukakis May 1, 2025, 6:52 AM
1 point
0
in reply to: Harry Lee-Jones’s comment on: How to Make Superbabies
These points have merit, but they also work for intelligent AI.

Ebenezer Dukakis Apr 26, 2025, 12:24 AM
1 point
0
in reply to: less-wronger-numb89’s comment on: less-wronger-numb89′s Shortform
Technically the point of going to college is to help you thrive in the rest of your life after college. If you believe in AI 2027, the most important thing for the rest of your life is for AI to be developed responsibly. So, maybe work on that instead of college?

I think the EU could actually be good place to protest for an AI pause. Because the EU doesn’t have national AI ambitions, and the EU is increasingly skeptical of the US, it seems to me that a bit of protesting could do a lot to raise awareness of the reckless path that the US is taking. That, in turn, could motivate the EU to apply leverage via ASML, sanctions, etc.

The only thing I’m worried about is that EU criticism of the US could create anti-EU polarization among the GOP in the US, which motivates them to be more reckless on AI. This question seems worth a lot more study.

Ebenezer Dukakis Apr 18, 2025, 3:39 AM
3 points
1
in reply to: Daniel Kokotajlo’s comment on: jacquesthibs’s Shortform
If people start losing jobs from automation, that could finally build political momentum for serious regulation.

Suggested in Zvi’s comments the other month (22 likes):

The real problem here is that AI safety feels completely theoretical right now. Climate folks can at least point to hurricanes and wildfires (even if connecting those dots requires some fancy statistical footwork). But AI safety advocates are stuck making arguments about hypothetical future scenarios that sound like sci-fi to most people. It’s hard to build political momentum around “trust us, this could be really bad, look at this scenario I wrote that will remind you of a James Cameron movie”

Here’s the thing though—the e/acc crowd might accidentally end up doing AI safety advocates a huge favor. They want to race ahead with AI development, no guardrails, full speed ahead. That could actually force the issue. Once AI starts really replacing human workers—not just a few translators here and there, but entire professions getting automated away—suddenly everyone’s going to start paying attention. Nothing gets politicians moving like angry constituents who just lost their jobs.

Here’s a wild thought: instead of focusing on theoretical safety frameworks that nobody seems to care about, maybe we should be working on dramatically accelerating workplace automation. Build the systems that will make it crystal clear just how transformative AI can be. It feels counterintuitive—like we’re playing into the e/acc playbook. But like extreme weather events create space to talk about carbon emissions, widespread job displacement could finally get people to take AI governance seriously. The trick is making sure this wake-up call happens before it’s too late to do anything about the bigger risks lurking around the corner.

Source: https://thezvi.substack.com/p/the-paris-ai-anti-safety-summit/comment/92963364

Just skimming the thread, I didn’t see anyone offer a serious attempt at counterargument, either.