I think that the AI’s internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn’t surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences—not just that the AI has a different concept boundary around “easy to understand”, say, but that it maybe doesn’t have any such internal notion as “easy to understand” at all, because easiness isn’t in the environment and the AI doesn’t have any such thing as “effort”. Maybe it’s got categories around yieldingness to seven different categories of methods, and/or some general notion of “can predict at all / can’t predict at all”, but no general notion that maps onto human “easy to understand”—though “easy to understand” is plausibly general-enough that I wouldn’t be unsurprised to find a mapping after all.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment, but I have some credit that John Wentworth might be able to do it some—and then you start mapping out concept definitions about corrigibility or values or god help you CEV, that might help highlight where some of my concern about unnatural abstractions comes in.
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like “be helpful” and “don’t betray Eliezer” and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don’t follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)
I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection. E.g. where a human might think about a task in terms of required willpower, an AGI might instead have internal concepts for required power consumption or compute threads or something.
Since human brains all share more or less the same architecture, you’d only expect significant misalignment between them if specific brains differed a lot from one another: e.g. someone with brain damage vs. a genius, or (as per an ACX post) a normal human vs. some one-of-a-kind person who doesn’t experience suffering due to some genetic quirk.
Or suppose we could upload people: then a flesh-and-blood human with a physical brain would have a different internal architecture from a digital human with a digital brain simulated on physical computer hardware. In which case their reflective concepts might diverge insofar as the simulation was imperfect and leaked details about the computer hardware and its constraints.
So it sounds like you are saying, it’s a matter of degree, not kind: Two humans will have minor differences between each other, and some humans (such as those with genetic quirks) will have major differences between each other.) But AIs vs. humans will have lots of major differences between each other.
So, how much difference is too much then? What’s the case that the AI-to-human differences (which are undoubtedly larger than the human-to-human differences) are large enough to cause serious problems (even in worlds where we avoid problem #2).
I thought this is what the “Shoggoth” metaphor for LLMs and AI assistants is pointing at: When reasoning about nonhuman minds, we employ intuitions that we’d evolved to think about fellow humans. Consequently, many arguments against AI x-risk from superintelligent agents employ intuitions that route through human-flavored concepts like kindness, altruism, reciprocity, etc.
The strength or weakness of those kinds of arguments depends on the extent to which the superintelligent agent uses or thinks in those human concepts. But those concepts arose in humans through the process of evolution, which is very different from how ML-based AIs are designed. Therefore there’s no prima facie reason to expect that a superintelligent AGI, designed with a very different mind architecture, would employ those human concepts. And so those aforementioned intuitions that argue against x-risk are unconvincing.
For example, if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
In contrast, if we encountered aliens, those would’ve presumably arisen from evolution, in which case their mind architectures would be closer to us than an artificially designed AGI, and this would make our intuitions comparatively more applicable. Although that wouldn’t suffice for value alignment with humanity. Related fiction: EY’s Three Worlds Collide.
if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
Somewhat disagree here—while we can’t use kindness to predict the internal “thought process” of the AI, [if we assume it’s not actively disobedient] the instructions mean that it will use an internal lossy model of what humans mean by kindness, and incorporate that into its act. Similar to how a talented human actor can realistically play a serial killer without having a “true” understanding of the urge to serially-kill people irl.
That’s a fair rebuttal. The actor analogy seems good: an actor will behave more or less like Abraham Lincoln in some situations, and very differently in others: e.g. on movie set vs. off movie set, vs. being with family, vs. being detained by police.
Similarly, the shoggoth will output similar tokens to Abraham Lincoln in some situations, and very different ones in others: e.g. in-distribution requests of famous Abraham Lincoln speeches, vs. out-of-distribution requests like asking for Abraham Lincoln’s opinions on 21st century art, vs. requests which invoke LLM token glitches like SolidGoldMagikarp, vs. unallowed requests that are denied by company policy & thus receive some boilerplate corporate response.
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don’t route through targeting via ML-style training.
I do think my deltas from many other people lie there—e.g. that’s why I’m nowhere near as optimistic as Quintin—so that’s also where I’d expect much of your disagreement with those other people to lie.
There isn’t really one specific thing, since we don’t yet know what the next ML/AI paradigm will look like, other than that some kind of neural net will probably be involved somehow. (My median expectation is that we’re ~1 transformers-level paradigm shift away from the things which take off.) But as a relatively-legible example of a targeting technique my hopes might route through: Retargeting The Search.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want.
I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI’s values in a way we like. But that they are in there at all sure seems like it’d make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have.
If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won’t kill everyone.
By ‘good interpretability’, I don’t necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI’s goals, by default, don’t need to be explicitly represented objects within the parameter structure of a single forward pass.
I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
Sure, but the sort of thing that people actually optimize for (revealed preferences) tends to be very different from what they proclaim to be their values. This is a point not often raised in polite conversation, but to me it’s a key reason for the thing people call “value alignment” being incoherent in the first place.
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn’t matter whether these stated values are ‘incoherent’ in the sense of not being in tune with actual human behavior, they’re useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don’t couple in the sense of being the revealed-preferences in an agentic model of the humans’ actions.
Every time a human tries and mostly fails to explain what things they’d like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.
An analogy: If you’re trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn’t a single correct answer, what the space of valid answers looks like.
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn’t be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity.
The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to.
Of course, it doesn’t look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it’s better for them to remain in blissful ignorance for now.
The concept of marriage depends on my internals in that a different human might disagree about whether a couple is married, based on the relative weight they place on religious, legal, traditional, and common law conceptions of marriage. For example, after a Catholic annulment and a legal divorce, a Catholic priest might say that two people were never married, whereas I would say that they were. Similarly, I might say that two men are married to each other, and someone else might say that this is impossible. How quickly those arguments have faded away! I don’t think someone would use the same example ten years ago.
It seems like “human values” aren’t particularly reflective then? Like I could describe the behavioral properties of a species of animal, including what they value or don’t value.
A lot of the particulars of humans’ values are heavily reflective. Two examples:
A large chunk of humans’ terminal values involves their emotional/experience states—happy, sad, in pain, delighted, etc.
Humans typically want ~terminally to have some control over their own futures.
Contrast that to e.g. a blue-minimizing robot, which just tries to minimize the amount of blue stuff in the universe. That utility function involves reflection only insofar as the robot is (or isn’t) blue.
I think you can unroll any of the positive examples by references to facts about the speaker. To be honest, I don’t understand what is supposed to be so reflective about “actual human values”, but perhaps it’s that the ontology is defined with reference to fairly detailed empirical facts about humans.
If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking “what would I think if I was in their situation”, but in principle an AI doesn’t have to work that way. But perhaps you think there are strong reasons why this would happen in practice?
Supposing we had strong reasons to believe that an AI system wasn’t self-aware and wasn’t capable of self-reflection. So it can look over a plan it generated and reason about its understandability, corrigibility, impact on human values, etc. without any reflective aspects. Does that make alignment any easier according to you?
Supposing the AI lacks a concept of “easy to understand”, as you hypothesize. Does it seem reasonable to think that it might not be all that great at convincing a gatekeeper to unbox it, since it might focus on super complex arguments which humans can’t understand?
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
Is this mostly about mesa-optimizers, or something else?
A potential big Model Delta in this conversation is between Yudkowsky-2022 and Yudkowsky-2024. From List of Lethalities:
The AI does not think like you do, the AI doesn’t have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien—nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
Vs the parent comment:
I think that the AI’s internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn’t surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences—not just that the AI has a different concept boundary around “easy to understand”, say, but that it maybe doesn’t have any such internal notion as “easy to understand” at all, because easiness isn’t in the environment and the AI doesn’t have any such thing as “effort”. Maybe it’s got categories around yieldingness to seven different categories of methods, and/or some general notion of “can predict at all / can’t predict at all”, but no general notion that maps onto human “easy to understand”—though “easy to understand” is plausibly general-enough that I wouldn’t be unsurprised to find a mapping after all.
Yudkowsky is “not particularly happy” with List of Lethalities, and this comment was made a day after the opening post, so neither quote should be considered a perfect expression of Yudkowsky’s belief. In particular the second quote is more epistemically modest, which might be because it is part of a conversation rather than a self-described “individual rant”. Still, the differences are stark. Is the AI utterly, incredibly alien “on a staggering scale”, or does the AI have “noticeable alignments to human ontology”? Are the differences pervasive with “nothing that would translate well”, or does it depend on whether the concepts are “purely predictive”, about “affordances and actions”, or have “reflective aspects”?
The second quote is also less lethal. Human-to-human comparisons seem instructive. A deaf human will have thoughts about electrons, but their internal ontology around affordances and actions will be less aligned. Someone like Eliezer Yudkwosky has the skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment, whereas I can’t do that because I project the category boundary onto the environment. Someone with dissociative identities may not have a general notion that maps onto my “myself”. Someone who is enlightened may not have a general notion that maps onto my “I want”. And so forth.
Regardless, different ontologies is still a clear risk factor. The second quote still modestly allows the possibility of a mind so utterly alien that it doesn’t have thoughts about electrons. And there are 42 other lethalities in the list. Security mindset says that risk factors can combine in unexpected ways and kill you.
I’m not sure if this is an update from Yudkowsky-2022 to Yudkowsky-2024. I might expect an update to be flagged as such (eg “I now think that...” instead of “I think that...”). But Yudkowsky said elsewhere that he has made some positive updates. I’m curious if this is one of them.
This is probably the wrong place to respond to the notion of incommensurable ontologies. Oh well, sorry.
While I agree that if an agent has a thoroughly incommensurable ontology, alignment is impossible (or perhaps even meaningless or incoherent), it also means that the agent has no access whatsoever to human science. If it can’t understand what we want, it also can’t understand what we’ve accomplished. To be more concrete, it will not understand electrons from any of our books, because it won’t understand our books. It won’t understand our equations, because it won’t understand equations nor will it have referents (neither theoretical nor observational) for the variables and entities contained there.
Consequently, it will have to develop science and technology from scratch. It took a long time for us to do that, and it will take that agent a long time to do it. Sure, it’s “superintelligent,” but understanding the physical world requires empirical work. That is time-consuming, it requires tools and technology, etc. Furthermore, an agent with an incommensurable ontology can’t manipulate humans effectively—it doesn’t understand us at all, aside from what it observes, which is a long, slow way to learn about us. Indeed it doesn’t even know that we are a threat, nor does it know what a threat is.
Long story short, it will be a long time—decades? Centuries? before such an agent would be able to prevent us from simply unplugging it. Science does not and cannot proceed at the speed of computation, so all of the “exponential improvement” in its “intelligence” is limited by the pace of knowledge growth.
Now, what if it has some purchase on human ontology? Well, then, it seems likely that it can grow that to a sufficient subset and in that way we can understand each other sufficiently well—it can understand our science, but also it can understand our values.
The point if you have one you’re likely to have the other. Of course, this does not mean that it will align with those values. But the incommensurable ontology argument just reduces to an argument for slow takeoff.
As to the last point, I agree that it seems likely that most iterations of AI can not be “pointed in a builder-intended direction” robustly. It’s like thinking you’re the last word on your children’s lifetime worth of thinking. Most likely (and hopefully!) they’ll be doing their own thinking at some point, and if the only thing the parent has said about that is “thou shalt not think beyond me”, the most likely result of that, looking only at the possibility we got to AGI and we’re here to talk about it, may be to remove ANY chance to influence them as adults. Life may not come with guarantees, who knew?
As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences
It could be worth exploring reflection in transparency-based AIs, the internals of which are observable. We can train a learning AI, which only learns concepts by grounding them on the AI’s internals (consider the example of a language-based AI learning a representation linking saying words and its output procedure). Even if AI-learned concepts do not coincide with human concepts, because the AI’s internals greatly differ from human experience (e.g. a notion of “easy to understand” assuming only a metaphoric meaning for an AI), AI-learned concepts remain interpretable to the programmer of the AI given the transparency of the AI (and the programmer of the AI could engineer control mechanisms to deal with disalignment). In other words, there will be unnatural abstractions, but they will be discoverable on the condition of training a different kind of AI—as opposed to current methods which are not inherently interpretable. This is monumental work, but desperately needed work
I think that the AI’s internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn’t surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences—not just that the AI has a different concept boundary around “easy to understand”, say, but that it maybe doesn’t have any such internal notion as “easy to understand” at all, because easiness isn’t in the environment and the AI doesn’t have any such thing as “effort”. Maybe it’s got categories around yieldingness to seven different categories of methods, and/or some general notion of “can predict at all / can’t predict at all”, but no general notion that maps onto human “easy to understand”—though “easy to understand” is plausibly general-enough that I wouldn’t be unsurprised to find a mapping after all.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment, but I have some credit that John Wentworth might be able to do it some—and then you start mapping out concept definitions about corrigibility or values or god help you CEV, that might help highlight where some of my concern about unnatural abstractions comes in.
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like “be helpful” and “don’t betray Eliezer” and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don’t follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)
I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection. E.g. where a human might think about a task in terms of required willpower, an AGI might instead have internal concepts for required power consumption or compute threads or something.
Since human brains all share more or less the same architecture, you’d only expect significant misalignment between them if specific brains differed a lot from one another: e.g. someone with brain damage vs. a genius, or (as per an ACX post) a normal human vs. some one-of-a-kind person who doesn’t experience suffering due to some genetic quirk.
Or suppose we could upload people: then a flesh-and-blood human with a physical brain would have a different internal architecture from a digital human with a digital brain simulated on physical computer hardware. In which case their reflective concepts might diverge insofar as the simulation was imperfect and leaked details about the computer hardware and its constraints.
So it sounds like you are saying, it’s a matter of degree, not kind: Two humans will have minor differences between each other, and some humans (such as those with genetic quirks) will have major differences between each other.) But AIs vs. humans will have lots of major differences between each other.
So, how much difference is too much then? What’s the case that the AI-to-human differences (which are undoubtedly larger than the human-to-human differences) are large enough to cause serious problems (even in worlds where we avoid problem #2).
I thought this is what the “Shoggoth” metaphor for LLMs and AI assistants is pointing at: When reasoning about nonhuman minds, we employ intuitions that we’d evolved to think about fellow humans. Consequently, many arguments against AI x-risk from superintelligent agents employ intuitions that route through human-flavored concepts like kindness, altruism, reciprocity, etc.
The strength or weakness of those kinds of arguments depends on the extent to which the superintelligent agent uses or thinks in those human concepts. But those concepts arose in humans through the process of evolution, which is very different from how ML-based AIs are designed. Therefore there’s no prima facie reason to expect that a superintelligent AGI, designed with a very different mind architecture, would employ those human concepts. And so those aforementioned intuitions that argue against x-risk are unconvincing.
For example, if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
In contrast, if we encountered aliens, those would’ve presumably arisen from evolution, in which case their mind architectures would be closer to us than an artificially designed AGI, and this would make our intuitions comparatively more applicable. Although that wouldn’t suffice for value alignment with humanity. Related fiction: EY’s Three Worlds Collide.
Somewhat disagree here—while we can’t use kindness to predict the internal “thought process” of the AI, [if we assume it’s not actively disobedient] the instructions mean that it will use an internal lossy model of what humans mean by kindness, and incorporate that into its act. Similar to how a talented human actor can realistically play a serial killer without having a “true” understanding of the urge to serially-kill people irl.
That’s a fair rebuttal. The actor analogy seems good: an actor will behave more or less like Abraham Lincoln in some situations, and very differently in others: e.g. on movie set vs. off movie set, vs. being with family, vs. being detained by police.
Similarly, the shoggoth will output similar tokens to Abraham Lincoln in some situations, and very different ones in others: e.g. in-distribution requests of famous Abraham Lincoln speeches, vs. out-of-distribution requests like asking for Abraham Lincoln’s opinions on 21st century art, vs. requests which invoke LLM token glitches like SolidGoldMagikarp, vs. unallowed requests that are denied by company policy & thus receive some boilerplate corporate response.
“I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection.”
It is not just the internal architecture. An AGI will have a completely different set of actuators and sensors compared to humans.
I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don’t route through targeting via ML-style training.
I do think my deltas from many other people lie there—e.g. that’s why I’m nowhere near as optimistic as Quintin—so that’s also where I’d expect much of your disagreement with those other people to lie.
Okay so where do most of your hopes route through then?
There isn’t really one specific thing, since we don’t yet know what the next ML/AI paradigm will look like, other than that some kind of neural net will probably be involved somehow. (My median expectation is that we’re ~1 transformers-level paradigm shift away from the things which take off.) But as a relatively-legible example of a targeting technique my hopes might route through: Retargeting The Search.
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fashion that this is a useful concept to have. E.g., when watching my cousin grow up, I find it useful and possible to have a notion in advance of what they will come to value when they are older and think more about what they want.
I do not think there is much reason by default for the representations of these human values and human value trajectories to be particularly related to the AI’s values in a way we like. But that they are in there at all sure seems like it’d make some research easier, compared to the counterfactual. For example, if you figure out how to do good interpretability, you can look into an AI and get a decent mathematical representation of human values and value trajectories out of it. This seems like a generally useful thing to have.
If you separately happen to have developed a way to point AIs at particular goals, perhaps also downstream of you having figured out how to do good interpretability[1], then having explicit access to a decent representation of human values and human expected-value-trajectories-under-additional-somewhat-coherent-reflection might be a good starting point for research on making superhuman AIs that won’t kill everyone.
By ‘good interpretability’, I don’t necessarily mean interpretability at the level where we understand a forward pass of GPT-4 so well that we can code our own superior LLM by hand in Python like a GOFAI. It might need to be better interpretability than that. This is because an AI’s goals, by default, don’t need to be explicitly represented objects within the parameter structure of a single forward pass.
Sure, but the sort of thing that people actually optimize for (revealed preferences) tends to be very different from what they proclaim to be their values. This is a point not often raised in polite conversation, but to me it’s a key reason for the thing people call “value alignment” being incoherent in the first place.
I kind of expect that things-people-call-their-values-that-are-not-their-revealed-preferences would be a concept that a smart AI that predicts systems coupled to humans would think in as well. It doesn’t matter whether these stated values are ‘incoherent’ in the sense of not being in tune with actual human behavior, they’re useful for modelling humans because humans use them to model themselves, and these self-models couple to their behavior. Even if they don’t couple in the sense of being the revealed-preferences in an agentic model of the humans’ actions.
Every time a human tries and mostly fails to explain what things they’d like to value if only they were more internally coherent and thought harder about things, a predictor trying to forecast their words and future downstream actions has a much easier time of it if they have a crisp operationalization of the endpoint the human is failing to operationalize.
An analogy: If you’re trying to predict what sorts of errors a diverse range of students might make while trying to solve a math problem, it helps to know what the correct answer is. Or if there isn’t a single correct answer, what the space of valid answers looks like.
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn’t be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity.
The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to.
Of course, it doesn’t look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it’s better for them to remain in blissful ignorance for now.
Could anyone possibly offer 2 positive and 2 negative examples of a reflective-in-this-sense concept?
Positive: “easy to understand”, “appealing”, “native (according to me) representation”
Negative: “apple”, “gluon”, “marriage”
The concept of marriage depends on my internals in that a different human might disagree about whether a couple is married, based on the relative weight they place on religious, legal, traditional, and common law conceptions of marriage. For example, after a Catholic annulment and a legal divorce, a Catholic priest might say that two people were never married, whereas I would say that they were. Similarly, I might say that two men are married to each other, and someone else might say that this is impossible. How quickly those arguments have faded away! I don’t think someone would use the same example ten years ago.
It seems like “human values” aren’t particularly reflective then? Like I could describe the behavioral properties of a species of animal, including what they value or don’t value.
But that leaves something out?
A lot of the particulars of humans’ values are heavily reflective. Two examples:
A large chunk of humans’ terminal values involves their emotional/experience states—happy, sad, in pain, delighted, etc.
Humans typically want ~terminally to have some control over their own futures.
Contrast that to e.g. a blue-minimizing robot, which just tries to minimize the amount of blue stuff in the universe. That utility function involves reflection only insofar as the robot is (or isn’t) blue.
I think you can unroll any of the positive examples by references to facts about the speaker. To be honest, I don’t understand what is supposed to be so reflective about “actual human values”, but perhaps it’s that the ontology is defined with reference to fairly detailed empirical facts about humans.
If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking “what would I think if I was in their situation”, but in principle an AI doesn’t have to work that way. But perhaps you think there are strong reasons why this would happen in practice?
Supposing we had strong reasons to believe that an AI system wasn’t self-aware and wasn’t capable of self-reflection. So it can look over a plan it generated and reason about its understandability, corrigibility, impact on human values, etc. without any reflective aspects. Does that make alignment any easier according to you?
Supposing the AI lacks a concept of “easy to understand”, as you hypothesize. Does it seem reasonable to think that it might not be all that great at convincing a gatekeeper to unbox it, since it might focus on super complex arguments which humans can’t understand?
Is this mostly about mesa-optimizers, or something else?
A potential big Model Delta in this conversation is between Yudkowsky-2022 and Yudkowsky-2024. From List of Lethalities:
Vs the parent comment:
Yudkowsky is “not particularly happy” with List of Lethalities, and this comment was made a day after the opening post, so neither quote should be considered a perfect expression of Yudkowsky’s belief. In particular the second quote is more epistemically modest, which might be because it is part of a conversation rather than a self-described “individual rant”. Still, the differences are stark. Is the AI utterly, incredibly alien “on a staggering scale”, or does the AI have “noticeable alignments to human ontology”? Are the differences pervasive with “nothing that would translate well”, or does it depend on whether the concepts are “purely predictive”, about “affordances and actions”, or have “reflective aspects”?
The second quote is also less lethal. Human-to-human comparisons seem instructive. A deaf human will have thoughts about electrons, but their internal ontology around affordances and actions will be less aligned. Someone like Eliezer Yudkwosky has the skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment, whereas I can’t do that because I project the category boundary onto the environment. Someone with dissociative identities may not have a general notion that maps onto my “myself”. Someone who is enlightened may not have a general notion that maps onto my “I want”. And so forth.
Regardless, different ontologies is still a clear risk factor. The second quote still modestly allows the possibility of a mind so utterly alien that it doesn’t have thoughts about electrons. And there are 42 other lethalities in the list. Security mindset says that risk factors can combine in unexpected ways and kill you.
I’m not sure if this is an update from Yudkowsky-2022 to Yudkowsky-2024. I might expect an update to be flagged as such (eg “I now think that...” instead of “I think that...”). But Yudkowsky said elsewhere that he has made some positive updates. I’m curious if this is one of them.
This is probably the wrong place to respond to the notion of incommensurable ontologies. Oh well, sorry.
While I agree that if an agent has a thoroughly incommensurable ontology, alignment is impossible (or perhaps even meaningless or incoherent), it also means that the agent has no access whatsoever to human science. If it can’t understand what we want, it also can’t understand what we’ve accomplished. To be more concrete, it will not understand electrons from any of our books, because it won’t understand our books. It won’t understand our equations, because it won’t understand equations nor will it have referents (neither theoretical nor observational) for the variables and entities contained there.
Consequently, it will have to develop science and technology from scratch. It took a long time for us to do that, and it will take that agent a long time to do it. Sure, it’s “superintelligent,” but understanding the physical world requires empirical work. That is time-consuming, it requires tools and technology, etc. Furthermore, an agent with an incommensurable ontology can’t manipulate humans effectively—it doesn’t understand us at all, aside from what it observes, which is a long, slow way to learn about us. Indeed it doesn’t even know that we are a threat, nor does it know what a threat is.
Long story short, it will be a long time—decades? Centuries? before such an agent would be able to prevent us from simply unplugging it. Science does not and cannot proceed at the speed of computation, so all of the “exponential improvement” in its “intelligence” is limited by the pace of knowledge growth.
Now, what if it has some purchase on human ontology? Well, then, it seems likely that it can grow that to a sufficient subset and in that way we can understand each other sufficiently well—it can understand our science, but also it can understand our values.
The point if you have one you’re likely to have the other. Of course, this does not mean that it will align with those values. But the incommensurable ontology argument just reduces to an argument for slow takeoff.
I’ve published this point as part of a paper in Informatica. https://www.informatica.si/index.php/informatica/article/view/1875
As to the last point, I agree that it seems likely that most iterations of AI can not be “pointed in a builder-intended direction” robustly. It’s like thinking you’re the last word on your children’s lifetime worth of thinking. Most likely (and hopefully!) they’ll be doing their own thinking at some point, and if the only thing the parent has said about that is “thou shalt not think beyond me”, the most likely result of that, looking only at the possibility we got to AGI and we’re here to talk about it, may be to remove ANY chance to influence them as adults. Life may not come with guarantees, who knew?
Warmly,
Keith
It could be worth exploring reflection in transparency-based AIs, the internals of which are observable. We can train a learning AI, which only learns concepts by grounding them on the AI’s internals (consider the example of a language-based AI learning a representation linking saying words and its output procedure). Even if AI-learned concepts do not coincide with human concepts, because the AI’s internals greatly differ from human experience (e.g. a notion of “easy to understand” assuming only a metaphoric meaning for an AI), AI-learned concepts remain interpretable to the programmer of the AI given the transparency of the AI (and the programmer of the AI could engineer control mechanisms to deal with disalignment). In other words, there will be unnatural abstractions, but they will be discoverable on the condition of training a different kind of AI—as opposed to current methods which are not inherently interpretable. This is monumental work, but desperately needed work