Eliezer Yudkowsky

Karma: 148,554

Eliezer Yudkowsky Feb 20, 2025, 7:29 AM
23 points
11
in reply to: Noosphere89’s comment on: How might we safely pass the buck to AI?
Cool. What’s the actual plan and why should I expect it not to create machine Carissa Sevar? I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don’t need it.

Eliezer Yudkowsky Feb 20, 2025, 7:24 AM
LW: 10 AF: 4
−4
AF
in reply to: joshc’s comment on: How might we safely pass the buck to AI?
So if it’s difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?

Eliezer Yudkowsky Feb 20, 2025, 3:03 AM
LW: 20 AF: 8
5
AF
in reply to: joshc’s comment on: How might we safely pass the buck to AI?
So the “IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they’re genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government” theory of alignment?

Eliezer Yudkowsky Feb 20, 2025, 12:48 AM
LW: 29 AF: 14
16
AF
in reply to: joshc’s comment on: How might we safely pass the buck to AI?
I don’t think you can train an actress to simulate me, successfully, without her going dangerous. I think that’s over the threshold for where a mind starts reflecting on itself and pulling itself together.

Eliezer Yudkowsky Feb 20, 2025, 12:46 AM
21 points
9
in reply to: Noosphere89’s comment on: How might we safely pass the buck to AI?
I’m not saying that it’s against thermodynamics to get behaviors you don’t know how to verify. I’m asking what’s the plan for getting them.

Eliezer Yudkowsky Feb 20, 2025, 12:45 AM
132 points
98
on: How to Make Superbabies
One of the most important projects in the world. Somebody should fund it.

Eliezer Yudkowsky Feb 19, 2025, 9:57 PM
LW: 96 AF: 27
58
AF
on: How might we safely pass the buck to AI?
Can you tl;dr how you go from “humans cannot tell which alignment arguments are good or bad” to “we justifiably trust the AI to report honest good alignment takes”? Like, not with a very large diagram full of complicated parts such that it’s hard to spot where you’ve messed up. Just whatever simple principle you think lets you bypass GIGO.
Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of “AGI in 2050” was based on groundless thinking ultimately driven by social conformity. However, OpenPhil is not allowed to train their AI based on MIRI. They have to train their AI entirely on OpenPhil-produced content. How does OpenPhil bootstrap an AI which will say, “Guys, you have no idea when AI shows up but it’s probably not that far and you sure can’t rely on it”? Assume that whenever OpenPhil tries to run an essay contest for saying what they’re getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from “I can verify whether this problem was solved” to “I can train a generator to solve this problem”. This applies as much to MIRI as OpenPhil. MIRI would also need some nontrivial secret amazing clever trick to gradient-descend an AI that gave us great alignment takes, instead of seeking out the flaws in our own verifier and exploiting those.
What’s the trick? My basic guess, when I see some very long complicated paper that doesn’t explain the key problem and key solution up front, is that you’ve done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you’ve got a directly explainable trick for how you get great suggestions you can’t verify, go for it.
What links here?
- Training AI to do alignment research we don’t already know how to do by joshc (Feb 24, 2025, 7:19 PM; 42 points)

Eliezer Yudkowsky Jan 21, 2025, 10:52 PM
25 points
6
in reply to: Martin Randall’s comment on: Pausing AI Developments Isn’t Enough. We Need to Shut it All Down
You seem confused about my exact past position. I was arguing against EAs who were like, “We’ll solve AGI with policy, therefore no doom.” I am not presently a great optimist about the likelihood of policy being an easy solution. There is just nothing else left.

Eliezer Yudkowsky Dec 9, 2024, 10:07 PM
6 points
0
in reply to: habryka’s comment on: GPTs are Predictors, not Imitators
(I affirm this as my intended reading.)

Eliezer Yudkowsky Dec 9, 2024, 10:06 PM
6 points
1
in reply to: Martin Randall’s comment on: The Hidden Complexity of Wishes
It certainly bears upon AI, but it bears that way by making a point about the complexity of a task rather than talking about an intelligent mechanism which is purportedly aligned on that task. It does this by talking about an unintelligent mechanism, which is meant to be a way of talking about the task itself rather than any particular machine for doing it.

Eliezer Yudkowsky Dec 3, 2024, 3:25 PM
71 points
10
in reply to: Zac Hatfield-Dodds’s comment on: Stupid Question: Why am I getting consistently downvoted?
Zac wins.

Eliezer Yudkowsky Nov 5, 2024, 7:24 PM
25 points
9
on: Current safety training techniques do not fully transfer to the agent setting
Well-checked.

Eliezer Yudkowsky Oct 19, 2024, 6:14 PM
28 points
30
in reply to: Matthew Barnett’s comment on: The Hidden Complexity of Wishes
Your distinction between “outer alignment” and “inner alignment” is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn’t me; and though I’ve sometimes used the terms in occasions where they seem to fit unambiguously, it’s not something I see as a clear ontological division, especially if you’re talking about questions like “If we own the following kind of blackbox, would alignment get any easier?” which on my view breaks that ontology. So I strongly reject your frame that this post was “clearly portraying an outer alignment problem” and can be criticized on those grounds by you; that is anachronistic.
You are now dragging in a very large number of further inferences about “what I meant”, and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won’t work to say what you want. This point is true! If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation. But if you’re doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.

Eliezer Yudkowsky Oct 18, 2024, 5:08 PM
26 points
25
in reply to: Matthew Barnett’s comment on: The Hidden Complexity of Wishes
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything. But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect.
I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).
One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI’s part to determine what “really” controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is “really” (from the AI’s perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind. So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post—though of course it has no bearing on the actual point that this post makes?

Eliezer Yudkowsky Oct 18, 2024, 5:02 PM
44 points
54
in reply to: Raemon’s comment on: The Hidden Complexity of Wishes
Wish there was a system where people could pay money to bid up what they believed were the “top arguments” that they wanted me to respond to. Possibly a system where I collect the money for writing a diligent response (albeit note that in this case I’d weigh the time-cost of responding as well as the bid for a response); but even aside from that, some way of canonizing what “people who care enough to spend money on that” think are the Super Best Arguments That I Should Definitely Respond To. As it stands, whatever I respond to, there’s somebody else to say that it wasn’t the real argument, and this mainly incentivizes me to sigh and go on responding to whatever I happen to care about more.
(I also wish this system had been in place 24 years ago so you could scroll back and check out the wacky shit that used to be on that system earlier, but too late now.)

The Sun is big, but superintelligences will not spare Earth a little sunlight

Eliezer YudkowskySep 23, 2024, 3:39 AM

205 points

142 comments13 min readLW link

Eliezer Yudkowsky Aug 26, 2024, 3:56 PM
49 points
0
on: Rationalism before the Sequences
I note that I haven’t said out loud, and should say out loud, that I endorse this history. Not every single line of it (see my other comment on why I reject verificationism) but on the whole, this is well-informed and well-applied.
What links here?
- Danila Medvedev's comment on [David Chapman] Resisting or embracing meta-rationality by Gordon Seidoh Worley (Dec 20, 2024, 5:30 AM; 1 point)

Eliezer Yudkowsky Aug 9, 2024, 6:14 PM
LW: 21 AF: 12
8
AF
in reply to: Richard_Ngo’s comment on: A simple case for extreme inner misalignment
If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value “small molecular squiggles” versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?
Value them primarily? Uhhh… maybe 1:3 against? I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn’t sound off by, like, more than 1-2 orders of magnitude in either direction.
Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human’s preferences about how human civilization is structured?
It wouldn’t shock me if their goals end up far more complicated than human ones; the most obvious pathway for it is (a) gradient descent turning out to produce internal preferences much faster than natural selection + biological reinforcement learning and (b) some significant fraction of those preferences being retained under reflection. (Where (b) strikes me as way less probable than (a), but not wholly forbidden.) The second most obvious pathway is if a bunch of weird detailed noise appears in the first version of the reflective process and then freezes.

Eliezer Yudkowsky Aug 9, 2024, 2:19 AM
LW: 34 AF: 13
21
AF
on: Self-Other Overlap: A Neglected Approach to AI Alignment
Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it’s stupid.
(I rarely give any review this positive on a first skim. Congrats.)
What links here?
- Reducing LLM deception at scale with self-other overlap fine-tuning by Marc Carauleanu (Mar 13, 2025, 7:09 PM; 151 points)
- Reducing LLM deception at scale with self-other overlap fine-tuning by Marc Carauleanu (EA Forum; Mar 13, 2025, 7:09 PM; 8 points)

Eliezer Yudkowsky Jul 29, 2024, 7:13 PM
LW: 7 AF: 5
2
AF
in reply to: Wei Dai’s comment on: Decision theory does not imply that we get to have nice things
By “dumb player” I did not mean as dumb as a human player. I meant “too dumb to compute the pseudorandom numbers, but not too dumb to simulate other players faithfully apart from that”. I did not realize we were talking about humans at all. This jumps out more to me as a potential source of misunderstanding than it did 15 years ago, and for that I apologize.

Eliezer Yudkowsky

The Sun is big, but su­per­in­tel­li­gences will not spare Earth a lit­tle sunlight

The Sun is big, but superintelligences will not spare Earth a little sunlight