It certainly bears upon AI, but it bears that way by making a point about the complexity of a task rather than talking about an intelligent mechanism which is purportedly aligned on that task. It does this by talking about an unintelligent mechanism, which is meant to be a way of talking about the task itself rather than any particular machine for doing it.
Eliezer Yudkowsky
Zac wins.
Well-checked.
Your distinction between “outer alignment” and “inner alignment” is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn’t me; and though I’ve sometimes used the terms in occasions where they seem to fit unambiguously, it’s not something I see as a clear ontological division, especially if you’re talking about questions like “If we own the following kind of blackbox, would alignment get any easier?” which on my view breaks that ontology. So I strongly reject your frame that this post was “clearly portraying an outer alignment problem” and can be criticized on those grounds by you; that is anachronistic.
You are now dragging in a very large number of further inferences about “what I meant”, and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won’t work to say what you want. This point is true! If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation. But if you’re doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything. But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect.
I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).
One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI’s part to determine what “really” controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is “really” (from the AI’s perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind. So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post—though of course it has no bearing on the actual point that this post makes?
Wish there was a system where people could pay money to bid up what they believed were the “top arguments” that they wanted me to respond to. Possibly a system where I collect the money for writing a diligent response (albeit note that in this case I’d weigh the time-cost of responding as well as the bid for a response); but even aside from that, some way of canonizing what “people who care enough to spend money on that” think are the Super Best Arguments That I Should Definitely Respond To. As it stands, whatever I respond to, there’s somebody else to say that it wasn’t the real argument, and this mainly incentivizes me to sigh and go on responding to whatever I happen to care about more.
(I also wish this system had been in place 24 years ago so you could scroll back and check out the wacky shit that used to be on that system earlier, but too late now.)
I note that I haven’t said out loud, and should say out loud, that I endorse this history. Not every single line of it (see my other comment on why I reject verificationism) but on the whole, this is well-informed and well-applied.
If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value “small molecular squiggles” versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?
Value them primarily? Uhhh… maybe 1:3 against? I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn’t sound off by, like, more than 1-2 orders of magnitude in either direction.
Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human’s preferences about how human civilization is structured?
It wouldn’t shock me if their goals end up far more complicated than human ones; the most obvious pathway for it is (a) gradient descent turning out to produce internal preferences much faster than natural selection + biological reinforcement learning and (b) some significant fraction of those preferences being retained under reflection. (Where (b) strikes me as way less probable than (a), but not wholly forbidden.) The second most obvious pathway is if a bunch of weird detailed noise appears in the first version of the reflective process and then freezes.
Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it’s stupid.
(I rarely give any review this positive on a first skim. Congrats.)
By “dumb player” I did not mean as dumb as a human player. I meant “too dumb to compute the pseudorandom numbers, but not too dumb to simulate other players faithfully apart from that”. I did not realize we were talking about humans at all. This jumps out more to me as a potential source of misunderstanding than it did 15 years ago, and for that I apologize.
I don’t always remember my previous positions all that well, but I doubt I would have said at any point that sufficiently advanced LDT agents are friendly to each other, rather than that they coordinate well with each other (and not so with us)?
Actually, to slightly amend that: The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument. Most of the time we don’t needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we’ll have needlessly used up all our resources and not get to have more fun. We buy cookies that cost a dollar instead of a hundred thousand dollars. A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess. “Great giant squiggles of nickel the size of a solar system would be no more valuable, even from a very embracing and cosmopolitan perspective on value” is the loadbearing part.
The part where squiggles are small and simple is unimportant. They could be bigger and more complicated, like building giant mechanical clocks. The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.
I think that the AI’s internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn’t surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences—not just that the AI has a different concept boundary around “easy to understand”, say, but that it maybe doesn’t have any such internal notion as “easy to understand” at all, because easiness isn’t in the environment and the AI doesn’t have any such thing as “effort”. Maybe it’s got categories around yieldingness to seven different categories of methods, and/or some general notion of “can predict at all / can’t predict at all”, but no general notion that maps onto human “easy to understand”—though “easy to understand” is plausibly general-enough that I wouldn’t be unsurprised to find a mapping after all.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment, but I have some credit that John Wentworth might be able to do it some—and then you start mapping out concept definitions about corrigibility or values or god help you CEV, that might help highlight where some of my concern about unnatural abstractions comes in.
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
- AI #68: Remarkably Reasonable Reactions by 13 Jun 2024 16:30 UTC; 46 points) (
- MIRI’s July 2024 newsletter by 15 Jul 2024 21:28 UTC; 25 points) (
- 10 Jun 2024 21:59 UTC; 6 points) 's comment on My AI Model Delta Compared To Yudkowsky by (
- 18 Sep 2024 16:31 UTC; 1 point) 's comment on Lucius Bushnaq’s Shortform by (
What the main post is responding to is the argument: “We’re just training AIs to imitate human text, right, so that process can’t make them get any smarter than the text they’re imitating, right? So AIs shouldn’t learn abilities that humans don’t have; because why would you need those abilities to learn to imitate humans?” And to this the main post says, “Nope.”
The main post is not arguing: “If you abstract away the tasks humans evolved to solve, from human levels of performance at those tasks, the tasks AIs are being trained to solve are harder than those tasks in principle even if they were being solved perfectly.” I agree this is just false, and did not think my post said otherwise.
Unless I’m greatly misremembering, you did pick out what you said was your strongest item from Lethalities, separately from this, and I responded to it. You’d just straightforwardly misunderstood my argument in that case, so it wasn’t a long response, but I responded. Asking for a second try is one thing, but I don’t think it’s cool to act like you never picked out any one item or I never responded to it.
EDIT: I’m misremembering, it was Quintin’s strongest point about the Bankless podcast. https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=cr54ivfjndn6dxraD
If Quintin hasn’t yelled “Empiricism!” then it’s not about him. This is more about (some) e/accs.
Wow, that’s fucked up.
I am denying that superintelligences play this game in a way that looks like “Pick an ordinal to be your level of sophistication, and whoever picks the higher ordinal gets $9.” I expect sufficiently smart agents to play this game in a way that doesn’t incentivize attempts by the opponent to be more sophisticated than you, nor will you find yourself incentivized to try to exploit an opponent by being more sophisticated than them, provided that both parties have the minimum level of sophistication to be that smart.
If faced with an opponent stupid enough to play the ordinal game, of course, you just refuse all offers less than $9, and they find that there’s no ordinal level of sophistication they can pick which makes you behave otherwise. Sucks to be them!
(I affirm this as my intended reading.)