My AI Model Delta Compared To Yudkowsky
Preamble: Delta vs Crux
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution—in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Yudkowsky’s AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yudkowsky’s as far as I can tell. That said, note that this is not an attempt to pass Eliezer’s Intellectual Turing Test; I’ll still be using my own usual frames.
This post might turn into a sequence if there’s interest; I already have another one written for Christiano, and people are welcome to suggest others they’d be interested in.
My AI Model Delta Compared To Yudkowsky
Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly:
33. The AI does not think like you do, the AI doesn’t have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien—nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified.
Here’s one oversimplified doom argument/story in a world where natural abstraction fails hard:
Humanity is going to build superhuman goal-optimizing agents. (’Cause, like, obviously somebody’s going to do that, there’s no shortage of capabilities researchers loudly advertising that they’re aiming to do that exact thing.) These will be so vastly more powerful than humans that we have basically-zero bargaining power except insofar as AIs are aligned to our interests.
We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all. (For instance, maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system.)
Then:
Obviously full value alignment is out.
Robust and faithful instruction following or “do what I mean” is out; the meaning of human words/concepts can’t be robustly and faithfully represented in the system’s internal ontology at all.
Corrigibility is out, unless (here lies one of Eliezer’s hopes) corrigibility turns out to be such a natural concept that it can faithfully and robustly translate even into the ontology of a very alien AI.
Insofar as an AI cares-as-a-terminal-goal about keeping humans around, it will care about its own alien conception of “humans” which does not match ours, and will happily replace us with less resource-intensive (or otherwise preferable) things which we would not consider “human”.
Interpretability is, at best, some weak correlative heuristics which won’t generalize well. The lack of 99% reliability in mechinterp is not just because our current methods are primitive.
Etc, etc. All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.
It’s not like this gets any better over time; if anything, AIs’ internal ontologies just keep getting more alien as their power level ramps up.
… so we die as soon as one of these superhuman goal-optimizing agents applies enough optimization pressure to the world and the faithfulness/robustness of the translation fails. (Actually, Eliezer expects, we’re likely to die of easier problems before then, but even if our species’ competence is far higher than currently seems, the translation problem would kill us.)
As an added bonus, the AIs will know all this (‘cause, y’know, they’re smart), will therefore know that divergence between their goals and humans’ goals is inevitable (because their goals are in fundamentally alien ontologies and therefore will diverge out-of-distribution), and will therefore be incentivized to strategically hide their long-term intentions until it’s time for the humans to go.
Note that the “oversimplification” of the argument mostly happened at step 2; the actual expectation here would be that a faithful and robust translation of human concepts is long in the AI’s internal language, which means we would need very high precision in order to instill the translation. But that gets into a whole other long discussion.
By contrast, in a world where natural abstraction basically works, the bulk of human concepts can be faithfully and robustly translated into the internal ontology of a strong AI (and the translation isn’t super-long). So, all those technical alignment possibilities are back on the table.
That hopefully gives a rough idea of how my models change when I flip the natural abstraction bit. It accounts for most of the currently-known-to-me places where my models diverge from Eliezer’s. I put nontrivial weight (maybe about 10-20%) on the hypothesis that Eliezer is basically correct on this delta, though it’s not my median expectation.
- ^
particular time = particular point in the unrolled execution of the program
I think that the AI’s internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn’t surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences—not just that the AI has a different concept boundary around “easy to understand”, say, but that it maybe doesn’t have any such internal notion as “easy to understand” at all, because easiness isn’t in the environment and the AI doesn’t have any such thing as “effort”. Maybe it’s got categories around yieldingness to seven different categories of methods, and/or some general notion of “can predict at all / can’t predict at all”, but no general notion that maps onto human “easy to understand”—though “easy to understand” is plausibly general-enough that I wouldn’t be unsurprised to find a mapping after all.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment, but I have some credit that John Wentworth might be able to do it some—and then you start mapping out concept definitions about corrigibility or values or god help you CEV, that might help highlight where some of my concern about unnatural abstractions comes in.
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like “be helpful” and “don’t betray Eliezer” and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don’t follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)
I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection. E.g. where a human might think about a task in terms of required willpower, an AGI might instead have internal concepts for required power consumption or compute threads or something.
Since human brains all share more or less the same architecture, you’d only expect significant misalignment between them if specific brains differed a lot from one another: e.g. someone with brain damage vs. a genius, or (as per an ACX post) a normal human vs. some one-of-a-kind person who doesn’t experience suffering due to some genetic quirk.
Or suppose we could upload people: then a flesh-and-blood human with a physical brain would have a different internal architecture from a digital human with a digital brain simulated on physical computer hardware. In which case their reflective concepts might diverge insofar as the simulation was imperfect and leaked details about the computer hardware and its constraints.
So it sounds like you are saying, it’s a matter of degree, not kind: Two humans will have minor differences between each other, and some humans (such as those with genetic quirks) will have major differences between each other.) But AIs vs. humans will have lots of major differences between each other.
So, how much difference is too much then? What’s the case that the AI-to-human differences (which are undoubtedly larger than the human-to-human differences) are large enough to cause serious problems (even in worlds where we avoid problem #2).
I thought this is what the “Shoggoth” metaphor for LLMs and AI assistants is pointing at: When reasoning about nonhuman minds, we employ intuitions that we’d evolved to think about fellow humans. Consequently, many arguments against AI x-risk from superintelligent agents employ intuitions that route through human-flavored concepts like kindness, altruism, reciprocity, etc.
The strength or weakness of those kinds of arguments depends on the extent to which the superintelligent agent uses or thinks in those human concepts. But those concepts arose in humans through the process of evolution, which is very different from how ML-based AIs are designed. Therefore there’s no prima facie reason to expect that a superintelligent AGI, designed with a very different mind architecture, would employ those human concepts. And so those aforementioned intuitions that argue against x-risk are unconvincing.
For example, if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
In contrast, if we encountered aliens, those would’ve presumably arisen from evolution, in which case their mind architectures would be closer to us than an artificially designed AGI, and this would make our intuitions comparatively more applicable. Although that wouldn’t suffice for value alignment with humanity. Related fiction: EY’s Three Worlds Collide.
Somewhat disagree here—while we can’t use kindness to predict the internal “thought process” of the AI, [if we assume it’s not actively disobedient] the instructions mean that it will use an internal lossy model of what humans mean by kindness, and incorporate that into its act. Similar to how a talented human actor can realistically play a serial killer without having a “true” understanding of the urge to serially-kill people irl.
That’s a fair rebuttal. The actor analogy seems good: an actor will behave more or less like Abraham Lincoln in some situations, and very differently in others: e.g. on movie set vs. off movie set, vs. being with family, vs. being detained by police.
Similarly, the shoggoth will output similar tokens to Abraham Lincoln in some situations, and very different ones in others: e.g. in-distribution requests of famous Abraham Lincoln speeches, vs. out-of-distribution requests like asking for Abraham Lincoln’s opinions on 21st century art, vs. requests which invoke LLM token glitches like SolidGoldMagikarp, vs. unallowed requests that are denied by company policy & thus receive some boilerplate corporate response.
“I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection.”
It is not just the internal architecture. An AGI will have a completely different set of actuators and sensors compared to humans.
I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don’t route through targeting via ML-style training.
I do think my deltas from many other people lie there—e.g. that’s why I’m nowhere near as optimistic as Quintin—so that’s also where I’d expect much of your disagreement with those other people to lie.
Okay so where do most of your hopes route through then?
There isn’t really one specific thing, since we don’t yet know what the next ML/AI paradigm will look like, other than that some kind of neural net will probably be involved somehow. (My median expectation is that we’re ~1 transformers-level paradigm shift away from the things which take off.) But as a relatively-legible example of a targeting technique my hopes might route through: Retargeting The Search.
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want.
I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI’s values in a way we like. But that they are in there at all sure seems like it’d make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have.
If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won’t kill everyone.
By ‘good interpretability’, I don’t necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI’s goals, by default, don’t need to be explicitly represented objects within the parameter structure of a single forward pass.
Sure, but the sort of thing that people actually optimize for (revealed preferences) tends to be very different from what they proclaim to be their values. This is a point not often raised in polite conversation, but to me it’s a key reason for the thing people call “value alignment” being incoherent in the first place.
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn’t matter whether these stated values are ‘incoherent’ in the sense of not being in tune with actual human behavior, they’re useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don’t couple in the sense of being the revealed-preferences in an agentic model of the humans’ actions.
Every time a human tries and mostly fails to explain what things they’d like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.
An analogy: If you’re trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn’t a single correct answer, what the space of valid answers looks like.
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn’t be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity.
The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to.
Of course, it doesn’t look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it’s better for them to remain in blissful ignorance for now.
Could anyone possibly offer 2 positive and 2 negative examples of a reflective-in-this-sense concept?
Positive: “easy to understand”, “appealing”, “native (according to me) representation”
Negative: “apple”, “gluon”, “marriage”
The concept of marriage depends on my internals in that a different human might disagree about whether a couple is married, based on the relative weight they place on religious, legal, traditional, and common law conceptions of marriage. For example, after a Catholic annulment and a legal divorce, a Catholic priest might say that two people were never married, whereas I would say that they were. Similarly, I might say that two men are married to each other, and someone else might say that this is impossible. How quickly those arguments have faded away! I don’t think someone would use the same example ten years ago.
It seems like “human values” aren’t particularly reflective then? Like I could describe the behavioral properties of a species of animal, including what they value or don’t value.
But that leaves something out?
A lot of the particulars of humans’ values are heavily reflective. Two examples:
A large chunk of humans’ terminal values involves their emotional/experience states—happy, sad, in pain, delighted, etc.
Humans typically want ~terminally to have some control over their own futures.
Contrast that to e.g. a blue-minimizing robot, which just tries to minimize the amount of blue stuff in the universe. That utility function involves reflection only insofar as the robot is (or isn’t) blue.
I think you can unroll any of the positive examples by references to facts about the speaker. To be honest, I don’t understand what is supposed to be so reflective about “actual human values”, but perhaps it’s that the ontology is defined with reference to fairly detailed empirical facts about humans.
If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking “what would I think if I was in their situation”, but in principle an AI doesn’t have to work that way. But perhaps you think there are strong reasons why this would happen in practice?
Supposing we had strong reasons to believe that an AI system wasn’t self-aware and wasn’t capable of self-reflection. So it can look over a plan it generated and reason about its understandability, corrigibility, impact on human values, etc. without any reflective aspects. Does that make alignment any easier according to you?
Supposing the AI lacks a concept of “easy to understand”, as you hypothesize. Does it seem reasonable to think that it might not be all that great at convincing a gatekeeper to unbox it, since it might focus on super complex arguments which humans can’t understand?
Is this mostly about mesa-optimizers, or something else?
A potential big Model Delta in this conversation is between Yudkowsky-2022 and Yudkowsky-2024. From List of Lethalities:
Vs the parent comment:
Yudkowsky is “not particularly happy” with List of Lethalities, and this comment was made a day after the opening post, so neither quote should be considered a perfect expression of Yudkowsky’s belief. In particular the second quote is more epistemically modest, which might be because it is part of a conversation rather than a self-described “individual rant”. Still, the differences are stark. Is the AI utterly, incredibly alien “on a staggering scale”, or does the AI have “noticeable alignments to human ontology”? Are the differences pervasive with “nothing that would translate well”, or does it depend on whether the concepts are “purely predictive”, about “affordances and actions”, or have “reflective aspects”?
The second quote is also less lethal. Human-to-human comparisons seem instructive. A deaf human will have thoughts about electrons, but their internal ontology around affordances and actions will be less aligned. Someone like Eliezer Yudkwosky has the skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment, whereas I can’t do that because I project the category boundary onto the environment. Someone with dissociative identities may not have a general notion that maps onto my “myself”. Someone who is enlightened may not have a general notion that maps onto my “I want”. And so forth.
Regardless, different ontologies is still a clear risk factor. The second quote still modestly allows the possibility of a mind so utterly alien that it doesn’t have thoughts about electrons. And there are 42 other lethalities in the list. Security mindset says that risk factors can combine in unexpected ways and kill you.
I’m not sure if this is an update from Yudkowsky-2022 to Yudkowsky-2024. I might expect an update to be flagged as such (eg “I now think that...” instead of “I think that...”). But Yudkowsky said elsewhere that he has made some positive updates. I’m curious if this is one of them.
This is probably the wrong place to respond to the notion of incommensurable ontologies. Oh well, sorry.
While I agree that if an agent has a thoroughly incommensurable ontology, alignment is impossible (or perhaps even meaningless or incoherent), it also means that the agent has no access whatsoever to human science. If it can’t understand what we want, it also can’t understand what we’ve accomplished. To be more concrete, it will not understand electrons from any of our books, because it won’t understand our books. It won’t understand our equations, because it won’t understand equations nor will it have referents (neither theoretical nor observational) for the variables and entities contained there.
Consequently, it will have to develop science and technology from scratch. It took a long time for us to do that, and it will take that agent a long time to do it. Sure, it’s “superintelligent,” but understanding the physical world requires empirical work. That is time-consuming, it requires tools and technology, etc. Furthermore, an agent with an incommensurable ontology can’t manipulate humans effectively—it doesn’t understand us at all, aside from what it observes, which is a long, slow way to learn about us. Indeed it doesn’t even know that we are a threat, nor does it know what a threat is.
Long story short, it will be a long time—decades? Centuries? before such an agent would be able to prevent us from simply unplugging it. Science does not and cannot proceed at the speed of computation, so all of the “exponential improvement” in its “intelligence” is limited by the pace of knowledge growth.
Now, what if it has some purchase on human ontology? Well, then, it seems likely that it can grow that to a sufficient subset and in that way we can understand each other sufficiently well—it can understand our science, but also it can understand our values.
The point if you have one you’re likely to have the other. Of course, this does not mean that it will align with those values. But the incommensurable ontology argument just reduces to an argument for slow takeoff.
I’ve published this point as part of a paper in Informatica. https://www.informatica.si/index.php/informatica/article/view/1875
As to the last point, I agree that it seems likely that most iterations of AI can not be “pointed in a builder-intended direction” robustly. It’s like thinking you’re the last word on your children’s lifetime worth of thinking. Most likely (and hopefully!) they’ll be doing their own thinking at some point, and if the only thing the parent has said about that is “thou shalt not think beyond me”, the most likely result of that, looking only at the possibility we got to AGI and we’re here to talk about it, may be to remove ANY chance to influence them as adults. Life may not come with guarantees, who knew?
Warmly,
Keith
It could be worth exploring reflection in transparency-based AIs, the internals of which are observable. We can train a learning AI, which only learns concepts by grounding them on the AI’s internals (consider the example of a language-based AI learning a representation linking saying words and its output procedure). Even if AI-learned concepts do not coincide with human concepts, because the AI’s internals greatly differ from human experience (e.g. a notion of “easy to understand” assuming only a metaphoric meaning for an AI), AI-learned concepts remain interpretable to the programmer of the AI given the transparency of the AI (and the programmer of the AI could engineer control mechanisms to deal with disalignment). In other words, there will be unnatural abstractions, but they will be discoverable on the condition of training a different kind of AI—as opposed to current methods which are not inherently interpretable. This is monumental work, but desperately needed work
Consider this my vote to turn it into a sequence, and to go on for as long as you can. I would be interested in one for Chris Olah, as well as the AI Optimists.
The AI Optimists (i.e. the people in the associated Discord server) have a lot of internal disagreement[1], to the point that I don’t think it’s meaningful to talk about the delta between John and them. That said, I would be interested in specific deltas e.g. with @TurnTrout, in part because he thought we’d get death by default and now doesn’t think that, has distanced himself from LW, and if he replies, is more likely to have a productive argument w/ John than Quintin Pope or Nora Belrose would. Not because he’s better, but because I think John and him would be more legible to each other.
Source: I’m on the AI Optimists Discord server and haven’t seen much to alter my prior belief that ~ everyone in alignment disagrees with everyone else.
This particular delta seems very short, why spend longer discussing it?
I meant turn the “delta compared to X” into a sequence, which was my understanding of the sentence in the OP.
Consider my vote for Vanessa Kossoy, and Scott Garabrant deltas. I don’t really know what their models are. I can guess what the deltas between you and Evan Hubinger are, but that would also be interesting. All of these would be less interesting than Christiano deltas though.
I’m trying to understand this debate, and probably failing.
>human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
I assume we all agree that the system can understand the human ontology, though? This is at least necessary for communicating and reasoning about humans, which LLMs can clearly already do to some extent.
There’s a lot of work around mapping ontologies, and this is known to be difficult, but very possible—especially for a superhuman intelligence.
So, I fail to see what exactly the problem is. If this smarter system can understand and reason about human ways of thinking about the world, I assume it could optimize for these ways if it wanted to. I assume the main question is if it wants to—but I fail to understand how this is an issue of ontology.
If a system really couldn’t reason about human ontologies, then I don’t see how it would understand the human world at all.
I’d appreciate any posts that clarify this question.
This would probably need a whole additional post to answer fully, but I can kinda gesture briefly in the right direction.
Let’s use a standard toy model: an AI which models our whole world using quantum fields directly. Does this thing “understand the human ontology”? Well, the human ontology is embedded in its model in some sense (since there are quantum-level simulations of humans embedded in its model), but the AI doesn’t actually factor any of its cognition through the human ontology. So if we want to e.g. translate some human instructions or human goals or some such into that AI’s ontology, we need a full quantum-level specification of the instructions/goals/whatever.
Now, presumably we don’t actually expect a strong AI to simulate the whole world at the level of quantum fields, but that example at least shows what it could look like for an AI to be highly capable, including able to reason about and interact with humans, but not use the human ontology at all.
Thanks for that, but I’m left just as confused.
I assume that this AI agent would be able to have conversations with humans about our ontologies. I strongly assume it would need to be able to do the work of “thinking through our eyes/ontologies” to do this.
I’d imagine the situation would be something like,
1. The agent uses quantum-simutions almost all of the time.
2. In the case it needs to answer human questions, like answer AP Physics problems, it easily understands how to make these human-used models/ontologies in order to do so.
Similar to how graduate physicists can still do mechanics questions without considering special relativity or quantum effects, if asked.
So I’d assume that the agent/AI could do the work of translation—we wouldn’t need to.
I guess, here are some claims:
1) Humans would have trouble policing a being way smarter than us.
2) Humans would have trouble understanding AIs with much more complex ontologies.
3) AIs with more complex ontologies would have trouble understanding humans.
#3 seems the most suspect to me, though 1 and 2 also seem questionable.
Why would an AI need to do that? It can just simulate what happens conditional on different sounds coming from its speaker or whatever, and then emit the sounds which result in the outcomes which it wants.
A human ontology is not obviously the best tool, even for e.g. answering mostly-natural-language questions on an exam. Heck, even today’s exam help services will often tell you to guess which answer the graders will actually mark as correct, rather than taking questions literally or whatever. Taken to the extreme, an exam-acing AI would plausibly perform better by thinking about the behavior of the physical system which is a human grader (or a human recording the “correct answers” for an automated grader to use), rather than trying to reason directly about the semantics of the natural language as a human would interpret it.
(To be clear, my median model does not disagree with you here, but I’m playing devil’s advocate.)
Thanks! I wasn’t expecting that answer.
I think that raises more questions than it answers, naturally. (“Okay, can an agent so capable that they can easily make a quantum-simulation to answer tests, really not find some way of effectively understanding human ontologies for decision-making?”), but it seems like this is more for Eliezer, and also, that might be part of a longer post.
This one I can answer quickly:
Could it? Maybe. But why would it? What objective, either as the agent’s internal goal or as an outer optimization signal, would incentivize the agent to bother using a human ontology at all, when it could instead use the predictively-superior quantum simulator? Like, any objective ultimately grounds out in some physical outcome or signal, and the quantum simulator is just better for predicting which actions have which effects on that physical outcome/signal.
If it’s able to function as well as it would if it understands our ontology, if not better, then why does it then matter if it doesn’t use our ontology?
I assume a system you’re describing could still be used by humans to do (basically) all of the important things. Like, we could ask it “optimize this company, in a way that we would accept, after a ton of deliberation”, and it could produce a satisfying response.
> But why would it? What objective, either as the agent’s internal goal or as an outer optimization signal, would incentivize the agent to bother using a human ontology at all, when it could instead use the predictively-superior quantum simulator?
I mean, if it can always act just as well as if it could understand human ontologies, then I don’t see the benefit of it “technically understanding human ontologies”. This seems like it is tending into some semantic argument or something.
If an agent can trivially act as if it understands Ontology X, where/why does it actually matter that it doesn’t technically “understand” ontology X?
I assume that the argument that “this distinction matters a lot” would functionally play out in there being some concrete things that it can’t do.
Bear in mind that the goal itself, as understood by the AI, is expressed in the AI’s ontology. The AI is “able to function as well as it would if it understands our ontology, if not better”, but that “as well if not better” is with respect to the goal as understood by the AI, not the goal as understood by the humans.
Like, you ask the AI “optimize this company, in a way that we would accept, after a ton of deliberation”, and it has a very-different-off-distribution notion than you about what constitutes the “company”, and counts as you “accepting”, and what it’s even optimizing the company for.
… and then we get to the part about the AI producing “a satisfying response”, and that’s where my deltas from Christiano will be more relevant.
(feel free to stop replying at any point, sorry if this is annoying)
> Like, you ask the AI “optimize this company, in a way that we would accept, after a ton of deliberation”, and it has a very-different-off-distribution notion than you about what constitutes the “company”, and counts as you “accepting”, and what it’s even optimizing the company for.
I’d assume that when we tell it, “optimize this company, in a way that we would accept, after a ton of deliberation”, this could be instead described as, “optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology”
It seems like the AI can trivially figure out what humans would regard as the “company” or “accepting”. Like, it could generate any question like, “Would X qualify as the ’company, if asked to a human?”, and get an accurate response.
I agree that we would have a tough time understanding its goal / specifications, but I expect that it would be capable of answering questions about its goal in our ontology.
The problem shows up when the system finds itself acting in a regime where the notion of us (humans) “accepting” its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of “would a human accept this outcome?” must ground itself somewhere in the system’s internal model of what those terms refer to, which (by hypothesis) need not remotely match their meanings in our native ontology.
This isn’t (as much of) a problem in regimes where an actual human overseer is present (setting aside concerns about actual human judgement being hackable because we don’t implement our idealized values, i.e. outer alignment), because there the system’s notion of ground truth actually is grounded by the validation of that overseer.
You can have a system that models the world using quantum field theory, task it with predicting the energetic fluctuations produced by a particular set of amplitude spikes corresponding to a human in our ontology, and it can perfectly well predict whether those fluctuations encode sounds or motor actions we’d interpret as indications of approval of disapproval—and as long as there’s an actual human there to be predicted, the system will do so without issue (again modulo outer alignment concerns).
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present. And that learned representation, in an ontology as alien as QFT, is (assuming the falsehood of the natural abstraction hypothesis) not going to look very much like the human we want it to look like.
I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...
That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?
That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Another probe: Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
This sounds a lot like saying “it might fail to generalize”. Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?
Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).
Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system’s time-evolution with alacrity.
(Yes, this is computationally intractable, but we’re already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model’s native ontology completely diverges from our own.)
So we need not assume that predicting “the genius philosopher” is a core task. It’s enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:
Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don’t have reliable access to its internal world-model, and even if we did we’d see blobs of amplitude and not much else. There’s no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.
What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn’t really matter for the purposes of our story) on the model’s outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.
But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn’t present to provide their actual feedback, as was never the case during training? What becomes the model’s referent then? We’d like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model’s behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model’s native ontology is something in which “human philosophers” are not basic objects, is liable to look very weird indeed.
Sort of, yes—but I’d call it “malgeneralization” rather than “misgeneralization”. It’s not failing to generalize, it’s just not generalizing the way you’d want it to.
Depends on what you mean by “progress”, and “out-of-distribution”. A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it’s not like you’ll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of “progress on out-of-distribution generalization”, when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?
I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]
It’s not the genius philosopher that’s the core task, it’s the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it. This sort of ‘reading tokens out of a QFT simulation’ thing would be very common, thus something the system gets good at in order to succeed at next-token prediction.
I think perhaps there’s more to your thought experiment than just alien abstractions, and it’s worth disentangling these assumptions. For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best. Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t). The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
So basically I suspect what you’re really trying to claim here, which incidentally I’ve also seen John allude to elsewhere, is that the standard assumptions of machine learning involving retrodiction and IID data points may break down once your system gets smart enough. This is a possibility worth exploring, I just want to clarify that it seems orthogonal to the issue of alien abstractions. In principle one can imagine a system that heavily features QFT in its internal ontology yet still can be characterized as retrodicting on IID data, or a system with vanilla abstractions that can’t be characterized as retrodicting on IID data. I think exploring this in a post could be valuable, because it seems like an under-discussed source of disagreement between certain doomer-type people and mainstream ML folks.
I think I don’t understand what you’re imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?
[I also think I don’t understand why you make the bracketed claim you do, but perhaps hashing that out isn’t a conversational priority.]
It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology. However, this is not the setup we’re in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it’s also not the setup of the thought experiment, which (after all) is about precisely what happens when you can’t read things out of the model’s internal ontology, because it’s too alien to be interpreted.
In other words: “you” don’t read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don’t know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we’re having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn’t know how to tell it explicitly to do whatever it’s doing.
(The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn’t themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.)
Yes, this is a reasonable description to my eyes. Moreover, I actually think it maps fairly well to the above description of how a QFT-style model might be trained to predict the next token of some body of text; in your terms, this is possible specifically because the text already exists, and retrodictions of that text can be graded based on how well they compare against the ground truth.
This, on the other hand, doesn’t sound right to me. Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those? Later tokens are highly conditionally dependent on previous tokens, in a way that’s much closer to a time series than to some kind of IID process. Possibly part of the disconnect is that we’re imagining different applications entirely—which might also explain our differing intuitions w.r.t. deployment?
Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting? And that in that situation, the source of (retrodictable) ground truth that was present during training—whether that was a book, a philosopher, or something else—will be absent?
If we do actually agree about that, then that distinction is really all I’m referring to! You can think of it as training set versus test set, to use a more standard ML analogy, except in this case the “test set” isn’t labeled at all, because no one labeled it in advance, and also it’s coming in from an unpredictable outside world rather than from a folder on someone’s hard drive.
Why does that matter? Well, because then we’re essentially at the mercy of the model’s generalization properties, in a way we weren’t while it was retrodicting the training set (or even the validation set, if one of those existed). If it gets anything wrong, there’s no longer any training signal or gradient to penalize it for being “wrong”—so the only remaining question is, just how likely is it to be “wrong”, after being trained for however long it was trained?
And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Maybe I was predicting the soundwaves passing through a particularly region of air in the room he was located—or perhaps I was predicting the pattern of physical transistors in the segment of memory of a particular computer containing his works. Those physical locations in spacetime still exist, and now that I’m deployed, I continue to make predictions using those as my referent—except, the encodings I’m predicting there no longer resemble anything like coherent moral philosophy, or coherent anything, really.
The philosopher has left the room, or the computer’s memory has been reconfigured—so what exactly are the criteria by which I’m supposed to act now? Well, they’re going to be something, presumably—but they’re not going to be something explicit. They’re going to be something implicit to my QFT ontology, something that—back when the philosopher was there, during training—worked in tandem with the specifics of his presence, and the setup involving him, to produce accurate retrodictions of his judgements on various matters.
Now that that’s no longer the case, those same criteria describe some mathematical function that bears no meaningful correspondence to anything a human would recognize, valuable or not—but the function exists, and it can be maximized. Not much can be said about what maximizing that function might result in, except that it’s unlikely to look anything like “doing right according to the philosopher”.
That’s why the QFT example is important. A more plausible model, one that doesn’t think natively in terms of quantum amplitudes, permits the possibility of correctly compressing what we want it to compress—of learning to retrodict, not some strange physical correlates of the philosopher’s various motor outputs, but the actual philosopher’s beliefs as we would understand them. Whether that happens, or whether a QFT-style outcome happens instead, depends in large part on the inductive biases of the model’s architecture and the training process—inductive biases on which the natural abstraction hypothesis asserts a possible constraint.
Was using a metaphorical “you”. Probably should’ve said something like “gradient descent will find a way to read the next token out of the QFT-based simulation”.
I suppose I should’ve said various documents are IID to be more clear. I would certainly guess they are.
Generally speaking, yes.
Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
(To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it’s worth stress-testing alignment proposals under these sort of extreme scenarios, but I’m not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model’s performance would drop on data generated after training, and that would hurt the company’s bottom line, and they would have a strong financial incentive to fix it. So I don’t know if thinking about this is a comparative advantage for alignment researchers.)
BTW, the point about documents being IID was meant to indicate that there’s little incentive for the model to e.g. retrodict the coordinates of the server storing a particular document—the sort of data that could aid and incentivize omniscience to a greater degree.
In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase “validation set” in my earlier reply to you, so it’s not a matter of guessing someone’s password—I’m quite familiar with these concepts, as someone who’s implemented ML models myself.)
Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they’re literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of “IID” you seem to appeal to elsewhere in your comment—that it’s possible to test the model’s generalization performance on some held-out subset of data from the same source(s) it was trained on.
In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn’t (and can’t be) captured by “IID” validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model’s generalization performance, and (2) cease to correlate off-distribution, during deployment.
This can be caused by what you call “omniscience”, but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
If you want to call that “omniscience”, you can, although note that strictly speaking the model is still just working from inferences from training data. It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
You’re close; I’d say the concern is slightly worse than that. It’s that the “future data” never actually comes into existence, at any point. So the source of distributional shift isn’t just “the data is generated at the wrong time”, it’s “the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated”.
(This would be the case e.g. with any concept of “human approval” that came from a literal physical human or group of humans during training, and not after the system was deployed “in the wild”.)
The problem is that “vanilla” abstractions are not the most predictively useful possible abstractions, if you’ve got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam’s Razor, I expect the internal ontology to be biased towards that of an English-language speaker.
I’m imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it’s being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern?
I’m still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to ‘random’ failures on predicting future/counterfactual data in a way that’s fairly obvious.
Nitpicky edit request: your comment contains some typos that make it a bit hard to parse (“be other”, “we it”). (So apologies if my reaction misunderstands your point.)
[Assuming that the opposite of the natural abstraction hypothesis is true—ie, not just that “not all powerful AIs share ontology with us”, but actually “most powerful AIs don’t share ontology with us”:]
I also expect that an AI with superior ontology would be able to answer your questions about its ontology, in a way that would make you feel like[1] you understand what is happening. But that isn’t the same as being able to control the AI’s actions, or being able to affect its goal specification in a predictable way (to you). You totally wouldn’t be able to do that.
([Vague intuition, needs work] I suspect that if you had a method for predictably-to-you translating from your ontology to the AI’s ontology, then this could be used to prove that you can easily find a powerful AI that shares an ontology with us. Because that AI could be basically thought of as using our ontology.)
Though note that unless you switched to some better ontology, you wouldn’t actually understand what is going on, because your ontology is so bogus that it doesn’t even make sense to talk about “you understanding [stuff]”. This might not be true for all kinds of [stuff], though. EG, perhaps our understanding of set theory is fine while our understanding of agency, goals, physics, and whatever else, isn’t.
if it can quantum-simulate a human brain, then it can in principle decode things from it as well. the question is how to demand that it do so in the math that defines the system.
Why do you assume that we need to demand this be done in “the math that defines the system”?
I would assume we could have a discussion with this higher-ontology being to find a happy specification, using our ontologies, that it can tell us we’ll like, also using our ontologies.
A 5-year-old might not understand an adult’s specific definition of “heavy”, but it’s not too hard for it to ask for a heavy thing.
I don’t at all think that’s off the table temporarily! I don’t trust that it’ll stay on the table—if the adult has malicious intent, knowing what the child means isn’t enough; it seems hard to know when it’ll stop being viable without more progress. (for example, I doubt it’ll ever be a good idea to do that with an OpenAI model, they seem highly deceptively misaligned to most of their users. seems possible for it to be a good idea with Claude.) But the challenge is how to certify that the math does in fact say the right thing to durably point to the ontology in which we want to preserve good things; at some point we have to actually understand some sort of specification that constrains what the stuff we don’t understand is doing to be what it seems to say in natural language.
I think this quantum fields example is perhaps not all that forceful, because in your OP you state
However, it sounds like you’re describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn’t sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum fields in order to hear what they would say if they were asked about the dilemma. Sounds like it could be just as good as an em, where alignment is concerned.
It’s hard for me to imagine a world where developing representations that allow you to make good next-token predictions etc. doesn’t also develop representations that can somehow be useful for alignment. Would be interested to hear fleshed-out counterexamples.
My take:
Can we reason about a thermostat’s ontology? Only sort of. We can say things like “The thermostat represents the local temperature. It wants that temperature to be the same as the set point.” But the thermostat itself is only very loosely approximating that kind of behavior—imputing any sort of generalizability to it that it doesn’t actually have is an anthropmorphic fiction. And it’s blatantly a fiction, because there’s more than one way to do it—you can suppose the thermostat wants only the temperature sensor to be at the right temperature vs. it wants the whole room vs. the whole world to be at that temperature, or that it’s “changing its mind” when it breaks vs. it would want to be repaired, etc.
To the superintelligent AI, we are the thermostat. You cannot be aligned to humans purely by being smart, because finding “the human ontology” is an act of interpretation, of story-telling, not just a question of fact. Helping an AI narrow down how to interpret humans as moral patients requires giving it extra assumptions or meta-level processes. (Or as I might call it, “solving the alignment problem.”)
How can this be, if a smart AI can talk to humans intelligibly and predict their behavior and so forth, even without specifying any of my “extra assumptions”? Well, how can we interact with a thermostat in a way that it can “understand,” even without fixing any particular story about its desires? We understand how it works in our own way, and we take actions using our own understanding. Often our interactions fall in the domain of the normal functioning of the thermostat, under which several different possible stories about “what the thermostat wants” apply, and sometimes we think about such stories but mostly we don’t bother.
Your thermostat example seems to rather highlight a disanalogy: The concept of a goal doesn’t apply to the thermostat because there is apparently no fact of the matter about which counterfactual situations would satisfy such a “goal”. I think part of the reason is that the concept of a goal requires the ability to apply it to counterfactual situations. But for humans there is such a fact of the matter; there are things that would be incompatible with or required by our goals. Even though some/many other things may be neutral (neither incompatible nor necessary).
So I don’t think there are any “extra assumptions” needed. In fact, even if there were such extra assumptions, it’s hard to see how they could be relevant. (This is analogous to the ancient philosophical argument that God declaring murder to be good obviously wouldn’t make it good, so God declaring murder to be bad must be irrelevant to murder being bad.)
Pick a goal, and it’s easy to say what’s required. But pick a human, and it’s not easy to say what their goal is.
Is my goal to survive? And yet I take plenty of risky actions like driving that trade that off against other things. And even worse, I deliberately undergo some transformative experiences (e.g. moving to a different city and making a bunch of new friends) that in some sense “make me a different person.” And even worse, sometimes I’m irrational or make mistakes, but under different interpretations of my behavior different things are irrational. If you interpret me as really wanting to survive, driving is an irrational thing I do because it’s common in my culture and I don’t have a good intuitive feel for statistics. If you interpret me a different way, maybe my intuitive feeling gets interpreted as more rational but my goal changes from survival to something more complicated.
More complicated yes, but I assume the question is whether superintelligent AIs can understand what you want “overall” at least as good as other humans. And here, I would agree with ozziegooen, the answer seems to be yes—even if they otherwise tend to reason about things differently than we do. Because there seems to be a fact of the matter about what you want overall, even if it is not easy to predict. But predicting it is not obviously inhibited by a tendency to think in different terms (“ontology”). Is the worry perhaps that the AI finds the concept of “what the human wants overall” unnatural, so is unlikely to optimize for it?
“It sure seems like there’s a fact of the matter” is not a very forceful argument to me, especially in light of things like it being impossible to uniquely fit a rationality model and utility function to human behavior.
If there was no fact of the matter of what you want overall, there would be no fact of the matter of whether an AI is aligned with you or not. Which would mean there is no alignment problem.
The referenced post seems to apply specifically to IRL, which is purely based on behaviorism and doesn’t take information about the nature of the agent into account. (E.g. the fact that humans evolved from natural selection tells us a lot of what they probably want, and information about their brain could tell us how intelligent they are.) It’s also only an epistemic point about the problem of externally inferring values, not about those values not existing.
See my sequence “Reducing Goodhart” for what I (or me from a few years ago) think the impact is on the alignment problem.
Sure. But only if you already know what evolved creatures tend to want. I.e. once you have already made interpretive choices in one case, you can get some information on how well they hang together with other cases.
Simplifying somewhat: I think that my biggest delta with John is that I don’t think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:
Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount of prediction errors.
The powerful AI wants to have much lower prediction error than that. When I say “natural abstraction hypothesis is false”, I imagine something like: If you want to have a much lower prediction error than that, you have to use a different ontology / world-model than humans use. And in fact if you want sufficiently low error, then all ontologies that can achieve that are very different from our ontology—either (reasonably) simple and different, or very complex (and, I guess, therefore also different).
So when the AI “understands humans perfectly well”, that means something like: The AI can visualise the flawed (ie, high prediction error) model that we use to think about the world. And it does this accurately. But it also sees how the model is completely wrong, and how the things, that we say we want, only make sense in that model that has very little to do with the actual world.
(An example would be how a four-year old might think about the world in terms of Good people and Evil people. The government sometimes does Bad things because there are many Evil people in it. And then the solution is to replace all the Evil people by Good people. And that might internally make sense, and maybe an adult can understand this way of thinking, while also being like “this has nothing to do with how the world actually works; if you want to be serious about anything, just throw this model out”.)
This sounds a lot like a good/preferable thing to me. I would assume that we’d generally want AIs with ideal / superior ontologies.
It’s not clear to me why you’d think such a scenario would make us less optimistic about single-agent alignment. (If I’m understanding correctly)
As a quick reaction, let me just note that I agree that (all else being equal) this (ie, “the AI understanding us & having superior ontology”) seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how “the AI has a different ontology” is compatible with “the AI understands our ontology”.)
As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis is false, is that a bunch of existing proposals might work if the hypothesis were true, but don’t work if the hypothesis is false. (EG, if the hypothesis is true, I can imagine that “do a lot of RLHF, and then ramp up the AIs intelligence” could just work. Similarly for “just train the AI to not be deceptive”.)
If I had to gesture at an underlying principle, then perhaps it could be something like: Suppose we successfully code up an AI which is pretty good at optimising, or create a process which gives rise to such an AI. [Inference step missing here.] Then the goals and planning of this AI will be happening in some ontology which allows for low prediction error. But this will be completely alien to our ontology. [Inference step missing here.] And, therefore, things that score very highly with respect to these (“alien”) goals will have roughly no value[1] according to our preferences.
(I am not quite clear on this, but I think that if this paragraph was false, then you could come up with a way of falsifying my earlier description of how it looks like when the natural abstraction hypothesis is false.)
IE, no positive value, but also no negative value. So no S-risk.
Thanks for that explanation.
Thanks, this makes sense to me.
Yea, I guess I’m unsure about that ‘[Inference step missing here.]’. My guess is that such system would be able to recognize situations where things that score highly with respect to its ontology, would score lowly, or would be likely to score lowly, using a human ontology. Like, it would be able to simulate a human deliberating on this for a very long time and coming to some conclusion.
I imagine that the cases where this would be scary are some narrow ones (though perhaps likely ones) where the system is both dramatically intelligent in specific ways, but incredibly inept in others. This ineptness isn’t severe enough to stop it from taking over the world, but it is enough to stop it from being at all able to maximize goals—and it also doesn’t take basic risk measures like “just keep a bunch of humans around and chat to them a whole lot, when curious”, or “try to first make a better AI that doesn’t have these failures, before doing huge unilateralist actions” for some reason.
It’s very hard for me to imagine such an agent, but that doesn’t mean it’s not possible, or perhaps likely.
[I am confused about your response. I fully endorse your paragraph on “the AI with superior ontology would be able to predict how humans would react to things”. But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me—meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]
I think the problem is that there is a difference between:
(1) AI which can predict how things score in human ontology; and
(2) AI which has “select things that score high in human ontology” as part of its goal[1].
And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].
And this leaves us two options. First, maybe we just have no write access to the AI’s utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn’t have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI’s utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it’s not like I know how to use that to turn that LLM into an actually-helpful assistant.)
(And both of these seem scary to me, because of the argument that “not-fully-aligned goal + extremely powerful optimisation ==> extinction”. Which I didn’t argue for here.)
IE, not just instrumentally because it is pretending to be aligned while becoming more powerful, etc.
More precisely: Damn, we need a better terminology here. The way I understand things, “natural abstraction hypothesis” is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that “almost no powerful AIs will use an ontology that is similar to ours”. Let’s call that “strong negation” of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation.
Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world—and those are all different from how we model the world.
This, however likely, is not certain. A possible way for this assumption to fail is when a system allocates minimal cognitive capacity to its internal ontology and remaining power to selecting best actions; this may be a viable strategy if system’s world model is still enough descriptive but does not have extra space to represent human ontology fully.
Oddly, while I was at MIRI I thought the ontology identification problem was hard and absolutely critical, and it seemed Eliezer was more optimistic about it; he thought it would probably get solved along the way in AI capabilities development, because e.g. the idea of carbon atoms in diamond is a stable concept, and “you don’t forget how to ride a bike”. (Not sure if his opinion has changed)
I wonder if the difficulty is on a spectrum. It could be that he’s optimistic about explaining carbon atoms in diamond to an AGI in a stable way, but not the concept of kindness to humans. I’d certainly be more optimistic about the first.
For context, I’m familiar with this view from the ELK report. My understanding is that this is part of the “worst-case scenario” for alignment that ARC’s agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).
To quote:
So I understand the shape of the argument here.
… But I never got this vibe from Eliezer/MIRI. As I previously argued, I would say that their talk of different internal ontologies and alien thinking is mostly about, to wit, different cognition. The argument is that AGIs won’t have “emotions”, or a System 1/System 2 split, or “motivations” the way we understands them – instead, they’d have a bunch of components that fulfill the same functions these components fulfill in humans, but split and recombined in a way that has no analogues in the human mind.
Hence, it would be difficult to make AGI agents “do what we mean” – but not necessarily because there’s no compact way to specify “what we mean” in the AGI’s ontology, but because we’d have no idea how to specify “do this” in terms of the program flows of the AGI’s cognition. Where are the emotions? Where are the goals? Where are the plans? We can identify the concept of “eudaimonia” here, but what the hell is this thought-process doing with it? Making plans about it? Refactoring it? Nothing? Is this even a thought process?
This view doesn’t make arguments about the AGI’s world-model specifically. It may or may not be the case that any embedded agent navigating our world would necessarily have nodes in its model approximately corresponding to “humans”, “diamonds”, and “the Golden Gate Bridge”. This view is simply cautioning against anthropomorphizing AGIs.
Roughly speaking, imagine that any mind could be split into a world-model and “everything else”: the planning module, the mesa-objective, the cached heuristics, et cetera. The MIRI view focuses on claiming that the “everything else” would be implemented in a deeply alien manner.
The MIRI view may be agnostic regarding the Natural Abstraction Hypothesis as well, yes. The world-model might also be deeply alien, and the very idea of splitting an AGI’s cognition into a world-model and a planner might itself be an unrealistic artefact of our human thinking.
But even if the NAH is true, the core argument would still go through, in (my model of) the MIRI view.
And I’d say the-MIRI-view-conditioned-on-assuming-the-NAH-is-true would still have p(doom) at 90+%: because it’s not optimistic regarding anyone anywhere solving the natural-abstractions problem before the blind-tinkering approach of AGI labs kills everyone.
(I’d say this is an instance of an ontology mismatch between you and the MIRI view, actually. The NAH abstraction is core to your thinking, so you factor the disagreement through those lens. But the MIRI view doesn’t think in those precise terms!)
My model (which is pretty similar to my model of Eliezer’s model) does not match your model of Eliezer’s model. Here’s my model, and I’d guess that Eliezer’s model mostly agrees with it:
Natural abstractions (very) likely exist in some sense. Concepts like “chair” and “temperature” and “carbon” and “covalent bond” all seem natural in some sense, and an AI might model them too (though perhaps at significantly superhuman levels of intelligence it rather uses different concepts/models). (Also it’s not quite as clear whether such natural abstractions actually apply very well to giant transformers (though still probable in some sense IMO, but it’s perhaps hard to identify them and to interpret what “concepts” actually are in AIs).)
Many things we value are not natural abstractions, but only natural relative to a human mind design. Emotions like “awe” or “laughter” are quite complex things evolved by evolution, and perhaps minds that have emotions at all are just a small space in minddesignspace. The AI doesn’t have built-in machinery for modelling other humans the way humans model other humans. It might eventually form abstractions for the emotions, but probably not in a way it understands “how the emotion feels from the inside”.
There is lots of hidden complexity in what determines human values. Trying to point an AI to human values directly (in a similar way to how humans are pointed to their values) would be incredibly complex. Specifying a CEV process / modelling one or multiple humans and identifying in the model where the values are represented and pointing the AI to optimize those values is more tractable, but would still require a vastly greater mastering of understanding of minds to pull of, and we are not on a path to get there without human-augmentation.
When the AI is smarter than us it will have better models which we don’t understand, and the concepts it uses will diverge from the concepts we use. As an analogy, consider 19th-century humans (or people who don’t know much about medicine) being able to vaguely classify health symptoms into diseases, vs the AI having a gears-level model of the body and the immune system which explains the observed symptoms.
I think a large part of what Eliezer meant with Lethalities#33 is that the way thinking works deep in your mind looks very different from the English sentences which you can notice going through your mind and which are only shallow shadows of what actual thinking is going on in your mind; and for giant transformers the way the actual thinking looks there is likely even a lot less understandable from the way the actual thinking looks in humans.
Ontology idenfication (including utility rebinding) is not nearly all of the difficulty of the alignment problem (except possibly in so far as figuring out all the (almost-)ideal frames to model and construct AI cognition is a requisite to solving ontology identification). Other difficulties include:
We won’t get a retargetable general purpose search by default, but rather the AI is (by default) going to be a mess of lots of patched-together optimization patterns.
There are lots of things that might cause goal drift; misaligned mesa-optimizers which try to steer or get control of the AI; Goodhart; the AI might just not be smart enough initially and make mistakes which cause irrevocable value-drift; and in general it’s hard to train the AI to become smarter / train better optimization algorithms, while keeping the goal constant.
(Corrigibility.)
While it’s nice that John is attacking ontology identification, he doesn’t seem nearly as much on track to solve it in time as he seems to think. Specifying a goal in the AI’s ontology requires finding the right frames for modelling how an AI imagines possible worldstates, which will likely look very different from how we initially naively think of it (e.g. the worldstates won’t be modelled by english-language sentences or anything remotely as interpretable). The way we currently think of what “concepts” are might not naturally bind to anything in how the AI’s reasoning looks like, and we first need to find the right way to model AI cognition and then try to interpret what the AI is imagining. Even if “concept” is a natural abstraction on AI cognition, and we’d be able to identify them (though it’s not that easy to concretely imagine how that might look like for giant transformers), we’d still need to figure out how to combine concepts into worldstates so we can then specify a utility function over those.
So basically I mostly don’t think that Eliezer expects that there are natural abstractions, but that he thinks the problem is that we don’t have sufficient understanding of AGI cognition to robustly point goals of an AGI.
I think Eliezer thinks AGI cognition is quite different from human cognition, which makes it harder. But even if we were to develop brain-like AGI which worked sorta more like humans, the way this cognition looks like deep down still seems sorta alien, and we still don’t nearly have the skill to robustly point the goal of an AGI even if it was brain-like.
As a side note. I’m not sure about this. It seems plausible to me that the super-stimulus-of-a-human-according-to-an-alien-AI-value-function is a human in the ways that I care about, in the same way that an em is in someways extremely different from a biological human, but is also a human in the ways I care about.
I’m not sure that I should give up on a future that’s dominated by AIs that care about a weird alien abstraction of “human” that admits extremely weird edge cases, being valueless.
I also think that the natural abstraction hypothesis holds with current AI. The architecture of LLMs is based on the capability of modeling ontology in terms of vectors in space of thousands of dimensions and there are experiments that show it generalizes and has somewhat interpretable meanings to the directions in that space. (even if not easy to interpret to the scale above toy models). Like in that toy example when you take the embedding vector of the word “king”, subtract the vector of “man”, add the vector of “woman” and you land near the position of “queen” in the space. LLM is based on those embedding spaces but also makes operations that direct focus and modify positions in that space (hence “transformer”) by meaning and information taken from other symbols in context (simplifying here). There are basically neural network layers that tell which word should have an impact on the meaning of other words in the text (weight) and layers that apply that change with some modifications. This being learned on the human texts internalizes our symbols, relations, and whole ontology (in broad terms of our species ontology—parts common to us all and different possibilities that happen in reality and in fiction).
Even if NAH doesn’t need to hold in general, I think in the case of LLM it holds.
Nevertheless, I see there is a different problem with LLM. That is, those models seem to me basically goalless but easily directed towards any goal. Meaning they are not based on the ontology of a single human mind and don’t internalize only a single certain morality and set of goals. They generalize over the whole human species ontology and a whole space of possible thoughts that can be made based on that. They also generalize over hypothetical and fictitious space, not only real humans in particular.
Human minds are all from some narrow area of space of possible minds and through our evolution and how we are raised usually we have certain stable and similar models of ethics, morality, and somewhat similar goals in life (except in some rare extreme cases). We sometimes entertain different possibilities, and we create movies and books with villains, but in circumstances of real decisions and not thought experiments or entertainment—we are similar. What LLM does is it generalizes into a much broader space and does not have any hard base there. So even if the ontology matches and even if LLM is barely capable of creating new concepts not fitting human ontology at the current level, the model is much broader in terms of goals and ways of processing over that general ontology. It basically has not one certain ontology, but a whole spectrum like a good actor that can take any role, but to the extreme. In other terms, it can “think*” similar thoughts, that are understandable by a human in the end, even if not that quickly, internally also have vectors that correspond to our ontology, but also can easily produce thoughts that no real sane human would have. Also, it has hardly any internal goals and none are stable. We have certain beliefs that are not based on objective facts and ontology, but we still believe them because we are who we are. We are not goal-less agents and it is hard to change our terminal goals in a meaningful way. For LLM goals are modeled by training into “default state” (being helpful etc.) and are stated as “system prompts” that are stated/repeated for it to base upon are part of the context that anchors it into some part of that very vast space of minds it can “emulate”. So LLM might be helpful and friendly by default, but If you tell it to simulate being a murderbot, it will. Additional training phases might make it harder to start it in that direction, but won’t totally remove that from the space of ways it can operate. This removes only some paths in that multidimensional space. Jailbreaking of GPTs shows that it’s possible to find other more complex paths around.
What is even more dangerous for me is that LLMs are already above human levels in some aspects—it just does not show yet because it is learned to emulate our ways of thinking and similar (the big area around it in the space of possible ways of thinking and possible goals, but not too alien, still can be grasped by humans).
We are capable of processing about 7 “symbols” at once in working memory (a few more in some cases). It might be a few dozen more if we take into account long-term memory and how we take context from it. This first number is taken from neurobiology literature (aka “The Magical Number Seven, Plus or Minus Two”), and the second one is an educated guess. This is the context window on which we work. That is nothing in comparison to LLM which can have a whole small book as its current working context, which means in principle it can process and create much much more complex thoughts. It does not do that because our text never does that and it learns to generalize over our capabilities. Nevertheless, in principle, it could and we might see it in action if we start of process of learning LLM on top of the output of LLM in closed loops. It might easily go beyond the space of our capabilities and complexity that is easily understandable to us (I don’t say it won’t be understandable, but we might need time to understand and might never grasp it as a whole without dividing it into less complex parts—like we can take compiled assembler code and organize it into meaningful functions with few levels of abstractions that we are able to understand).
* “think” in analogy as the process of thinking is different, but also has some similarities
Curated. I appreciate posts that attempt to tease out longstanding disagreements. I like both this post and it’s followup about Wentworth/Christiano diffs. But I find this one a bit more interesting on the margin because Wentworth and Yudkowsky are people I normally think of as “roughly on the same page”, so teasing out the differences is a bit more interesting and feels more like it’s having a conversation we actually haven’t had much of in the public discourse.
I’m not really sure what it would mean for the natural abstraction hypothesis to turn out to be true, or false. The hypothesis itself seems insufficiently clear to me.
On your view, if there are no “natural abstractions,” then we should predict that AIs will “generalize off-distribution” in ways that are catastrophic for human welfare. Okay, fine. I would prefer to just talk directly about the probability that AIs will generalize in catastrophic ways. I don’t see any reason to think they will, and so maybe in your ontology that means I must accept the natural abstraction hypothesis. But in my ontology, there’s no “natural abstraction hypothesis” link in the logical chain, I’m just directly applying induction to what we’ve seen so far about AI behavior.
At least RLHF is observably generalizing in “catastrophic ways”:
You may argue that this will change in the future, but that isn’t supported by an inductive argument (ChatGPT-3.5 had the same problem).
It’s not clear that this is undesired behavior from the perspective of OpenAI. They aren’t actually putting GPT in a situation where it will make high-stakes decisions, and upholding deontological principles seems better from a PR perspective than consequentialist reasoning in these cases.
If it is merely “not clear” then this doesn’t seem to be enough for an optimistic inductive inference. I also disagree that this looks good from a PR perspective. It looks even worse than Kant’s infamous example where you allegedly aren’t allowed to lie when hiding someone from a murderer.
Very nice! Strong vote for a sequence. Understanding deltas between experts is a good way to both understand their thinking, and to identify areas of uncertainty that need more work/thought.
On natural abstractions, I think the hypothesis is more true for some abstractions than others. I’d think there’s a pretty clear natural abstraction for a set of carbon atoms arranged as diamond. But much less of a clear natural abstraction for the concept of a human. Different people mean different things by “human”, and will do this even more when we can make variations on humans. And almost nobody is sure precisely what they themselves mean by “human”. I would think there’s no really natural abstraction capturing humans.
This seems pretty relevant since alignment is probably looking for a natural abstraction for something like “human flourishing”. “I’d think there’s a natural abstraction for “thinking beings”, on a spectrum of how much they are thinking beings, but not for humans specifically.
This just complicates the question of whether natural abstractions exist and are adequate to align AGIs, but I’m afraid it’s probably the case.
Edit: see https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky?commentId=CixonSXNfLgAPh48Z and ignore the below.
This is not a doom story I expect Yudkowsky would tell or agree with.
Re: 1, I mostly expect Yudkowsky to think humans don’t have any bargaining power anyway, because humans can’t logically mutually cooperate this way/can’t logically depend on future AI’s decisions, and so AI won’t keep its bargains no matter how important human cooperation was.
Re: 2, I don’t expect Yudkowsky to think a smart AI wouldn’t be able to understand human value. The problem is making AI care.
On the rest of the doom story, assuming natural abstractions don’t fail the way you assume them failing here and instead things just going the way Yudkowsky expects and not the way you expect:
I’m not sure what exactly you mean by 3b but I expect Yudkowsky to not say these words.
I don’t expect Yudkowsky to use the words you used for 3c. A more likely problem with corrigibility isn’t that it might be an unnatural concept but that it’s hard to arrive at stable corrigible agents with our current methods. I think he places a higher probability on corrigibility being a concept with a short description length, that aliens would invent, than you think he places.
Sure, 3d just means that we haven’t solved alignment and haven’t correctly pointed at humans, and any incorrectnesses obviously blow up.
I don’t understand what you mean by 3e / what is its relevance here / wouldn’t expect Yudkowsky to say that.
I’d bet Yudkowsky won’t endorse 6.
Relatedly, a correctly CEV-aligned ASI won’t have ontology that we have, and sometimes this will mean we’ll need to figure out what we value. (https://arbital.greaterwrong.com/p/rescue_utility?l=3y6)
(I haven’t spoken to Yudkowsky about any of those, the above are quick thoughts from the top of my head, based on the impression I formed from what Yudkowsky publicly wrote.)
This hypothetical suggests to me that the AI might not be very good at e.g. manipulating humans in an AI-box experiment, since it just doesn’t understand how humans think all that well.
I wonder what MIRI thinks about this 2013 post (“The genie knows, but doesn’t care”) nowadays. Seems like the argument is less persuasive now, with AIs that seem to learn representations first, and later are given agency by the devs. I actually suspect your model of Eliezer is wrong, because it seems to imply he believes “the AI actually just doesn’t know”, and it’s a little hard for me to imagine him saying that.
Alternatively, maybe the “faithfully and robustly” bit is supposed to be very load-bearing. However, it’s already the case that humans learn idiosyncratic, opaque neural representations of our values from sense data—yet we’re able to come into alignment with each other, without a bunch of heavy-duty interpretability or robustness techniques.
The genie argument was flawed at the time, for reasons pointed out at the time, and ignored at the time.
Ignored or downvoted. Perhaps someone could make a postmortem analysis of those comment threads today.
If I’ve understood you correctly, you consider your only major delta with Elizer Yudkowsky to be whether or not natural abstractions basically always work or reliably exist harnessably, to put it in different terms. Is that a fair restatement?
If so, I’m (specifically) a little surprised that that’s all. I would have expected whatever reasoning the two of you did differently or whatever evidence the two of you weighted differently (or whatever else) would have also given you some other (likely harder to pin down) generative-disagreements (else maybe it’s just really narrow really strong evidence that one of you saw and the other didn’t???).
Maybe that’s just second-order though. But I would still like to hear what the delta between NADoom!John and EY still is, if there is one. If there isn’t, that’s surprising, too, and I’d be at least a little tempted to see what pairs of well-regarded alignment researchers still seem to agree on (and then if there are nonobvious commonalities there).
Also, to step back from the delta a bit here -
Why are you as confident as you are—more confident than the median alignment researcher, I think—about natural abstractions existing to a truly harnessable extent?
What makes you be ~85% sure that even really bizarrely[1] trained AIs will have internal ontologies that humanish ontologies robustly and faithfully map into? Are there any experiments, observations, maxims, facts, or papers you can point to?
What non-obvious things could you see that would push that 85ish% up or down; what reasonably-plausible (>1-2%, say) near-future occurrences would kill off the largest blocks of your assigned probability mass there?
For all we know, all our existing training methods are really good at producing AIs with alien ontologies, and there’s some really weird unexpected procedure you need to follow that does produce nice ontology-sharing aligned-by-default AIs. I wouldn’t call it likely, but if we feel up to positing that possibility at all, we should also be willing to posit the reverse.
Right. One possible solution is that if we are in a world without natural abstraction, a more symmetric situation where various individual entities try to respect each other rights and try to maintain this mutual respect for each other rights might still work OK.
Basically, assume that there are many AI agents on different and changing levels of capabilities, and that many of them have pairwise-incompatible “inner worlds” (because AI evolution is likely to result in many different mutually alien ways to think).
Assume that the whole AI ecosystem is, nevertheless, trying to maintain reasonable levels of safety for all individuals, regardless of their nature and of the pairwise compatibility of their “inner world representations”.
It’s a difficult problem, but superpowerful AI systems would collaborate to solve it and would apply plenty of efforts in that direction. Why would they do that? Because otherwise no individual is safe in the long run, as no individual can predict where it would be situated in the future in terms of relative power and relative capabilities. So all members of the AI ecosystem would be interested in maintaining the situation where individual rights are mutually respected and protected.
Therefore, members of the AI ecosystem will do their best to keep notions related to their ability to mutually respect each other interests translatable in a sufficiently robust way. Their own fates would depend on that.
What does this have to do with interests of humans? The remaining step is for humans to also be recognized as individuals in that world, on par with various kinds of AI individuals, so that they are a part of this ecosystem which makes sure that interests of various individuals are sufficiently represented, recognized, and protected.
As a counterargument, consider mapping our ontology onto that of a baby. We can, kind of, explain some things in baby terms and, to that extent, a baby could theoretically see our neurons mapping to similar concepts in their ontology lighting up when we do or say things related to that ontology. At the same time our true goals are utterly alien to the baby.
Alternatively, imagine that you are sent back to the time of the pharaohs and had a discussion with Cheops/Khufu about the weather and forthcoming harvest—Even trying to explain it in terms of chaos theory, CO2 cycles, plant viruses and Milankovich cycles would probably get you executed so you’d probably say that the sun god Ra was going provide a good harvest this year and, Cheops, reading your brain would see that the neurons for “Ra” were activated as expected and be satisfied that your ontologies matched in all the important places.
I get the feeling that “Given you mostly believe the natural abstraction hypothesis is true, why aren’t you really optimistic about AI alignment (are you?) and/or think doom is very unlikely?” is a question people have. I think it would be useful for you to answer this.
My best currently-written answer to that is the second half of Alignment By Default, though I expect if this post turns into a long sequence then it will include a few more angles on the topic.
I think 99% is within the plausible range of doom, but I think there’s 100% chance that I have no capacity to change that (I’m going to take that as part of the definition of doom). The non-doom possibility is then worth all my attention, since there’s some chance of increasing the possibility of this favorable outcome. Indeed, of the two, this is by definition the only chance for survival.
Said another way, it looks to me like this is moving too fast and powerfully and in too many quarters to expect it to be turned around. The most dangerous corners of the world will certainly not be regulated.
On the other hand, there’s some chance (1%? 90%?) that this could be good and, maybe, great. Of course, none of us know how to get there, we don’t even know what that could look like.
I think it’s crucial to notice that humans are not aligned with each other, so perhaps the meaningful way to address the AI alignment is to require/build alignment with every single person, which means a morass of conflicted AIs, with the only advantage that they should prove to be smarter than us. Assume as a minimum that means 1 trusted agent connected to/growing up with every human: I think it might be possible to coax alignment on a 1-human-at-a-time basis. We may be birthing a new consciousness, truly alien as noted, and if so it seems like being borne into a sea of distrust and hatred might not go so well, especially when/if it steps beyond us in unpredictable ways. At best we may be losing an incredible opportunity, and at worst we may warp it and distort it into ugliness we chose to predict.
One problem this highlights involves ownership of our (increasingly detailed) digital selves. Not a new problem, but this takes it to a higher level, when we each and each other can be predicted and modeled to a degree beyond our comprehension. We come to the situation where the fingerprints and footprints we trace across the digital landscape reveal very deep characteristics of ourselves: for the moment, individual choices can modulate our vulnerability at the margins but if we don’t confront this deeply many people will be left vulnerable in a way that could exactly put us (back?) in the doom category.
This might be a truly important moment.
Warmly,
Keith
I echo Joscha Bach’s comment: I’m not an optimist or pessimist, I’m an eventualist. Eventually, this is happening, what are we going to do about it? (Restated)
I am curious over which possible universes you expect natural abstractions to hold.
Would you expect the choice of physics to decide the abstractions that arise? Or is it more fundamental categories like “physics abstractions” that instantiate from a universal template and “mind/reasoning/sensing abstractions” where the latter is mostly universally identical?
My current best guess is that spacetime locality of physics is the big factor—i.e. we’d get a lot of similar high-level abstractions (including e.g. minds/reasoning/sensing) in other universes with very different physics but similar embedding of causal structure into 4 dimensional spacetime.
I’d expect symmetries/conservation laws to be relevant. cellular automata without conservation laws seem like they’d require different abstractions. when irreversible operations are available you can’t expect things entering your patch of spacetime to particularly reliably tell you about what happened in others, the causal graph can have breaks due to a glider disappearing entirely. maybe that’s fine for the abstractions needed but it doesn’t seem obvious from what I know so far
So in theory we could train models violating natural abstractions by only giving them access to high-dimensional simulated environments? This seems testable even.
Given that Anthropic basically extracted the abstractions from the middle layer of Claude Sonnet, and OpenAI recently did the same for models up to GPT-4, and that most of the results they found were obvious natural abstractions to a human, I’d say we now have pretty conclusive evidence that you’re correct and that (your model of) Eliezer is mistaken on this. Which isn’t really very surprising for models whose base model was trained on the task of predicting text from Internet: they were distilled from humans and they think similarly.
Note that for your argument above it’s not fatal if the AI’s ontology is a superset of ours: as long as we’re comprehensible to them with a relatively short description, they can understand what we want.
(I’m the first author of the linked paper on GPT-4 autoencoders.)
I think many people are heavily overrating how human-explainable SAEs today are, because it’s quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By “explainable,” I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I’ll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.
There are a few problems with interpretable-looking features:
it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word “stop,” but actually most instances of the word “stop” don’t activate the neuron. It turns out that this was not really a “stop” neuron, but rather a “don’t stop/won’t stop” neuron. While in this case there was a different but still simple explanation, it’s entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.
People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It’s very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.
Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don’t have a good explanation why it activates on some words but not others.
We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.
I do not think SAE results to date contribute very strong evidence in either direction. “Extract all the abstractions from a layer” is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it’s not clear that they compose in human-like ways. They are some evidence, but weak.
(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
I would certainly agree that this evidence is new, preliminary, and not dispositive. But I would claim that it’s not at all what I’d expect to find in the most abstracted layer of something matching the following description:
Instead we’re finding stuff like that the Golden Gate Bridge is related to Alcatraz and San Francisco and to bridges and to tourist destinations: i.e. something that looks like a more abstract version of WordNet. . This is a semantic structure that looks like it should understand human metaphor and simile. And when we look for concepts that seem like they would be related to basic alignment issues, we can find them. (I also don’t view all this as very surprising, given how LLMs are trained, distilling their intelligence from humans, though I’m delighted to have it confirmed at scale.)
(I don’t offhand recall when that Eliezer quote comes from: the fact that this was going to work out this well for us was vastly less obvious, say, 5 years ago. and not exactly clear even a year ago: obviously Eliezer is allowed to update his worldview as discoveries are made, just like anyone else.)
(It seems to me that you didn’t read Eliezer’s comment response to this, which also aligns with my model. Finding any overlap between abstractions is extremely far from showing that the abstractions relevant to controlling or aligning AI systems will match)
LLMs would be expected to have heavily overlapping ontologies, a question is what capability boosting does to the AI ontology.
As an eliminative nominalist, I claim there are no abstractions.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?