So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like “be helpful” and “don’t betray Eliezer” and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don’t follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)
I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection. E.g. where a human might think about a task in terms of required willpower, an AGI might instead have internal concepts for required power consumption or compute threads or something.
Since human brains all share more or less the same architecture, you’d only expect significant misalignment between them if specific brains differed a lot from one another: e.g. someone with brain damage vs. a genius, or (as per an ACX post) a normal human vs. some one-of-a-kind person who doesn’t experience suffering due to some genetic quirk.
Or suppose we could upload people: then a flesh-and-blood human with a physical brain would have a different internal architecture from a digital human with a digital brain simulated on physical computer hardware. In which case their reflective concepts might diverge insofar as the simulation was imperfect and leaked details about the computer hardware and its constraints.
So it sounds like you are saying, it’s a matter of degree, not kind: Two humans will have minor differences between each other, and some humans (such as those with genetic quirks) will have major differences between each other.) But AIs vs. humans will have lots of major differences between each other.
So, how much difference is too much then? What’s the case that the AI-to-human differences (which are undoubtedly larger than the human-to-human differences) are large enough to cause serious problems (even in worlds where we avoid problem #2).
I thought this is what the “Shoggoth” metaphor for LLMs and AI assistants is pointing at: When reasoning about nonhuman minds, we employ intuitions that we’d evolved to think about fellow humans. Consequently, many arguments against AI x-risk from superintelligent agents employ intuitions that route through human-flavored concepts like kindness, altruism, reciprocity, etc.
The strength or weakness of those kinds of arguments depends on the extent to which the superintelligent agent uses or thinks in those human concepts. But those concepts arose in humans through the process of evolution, which is very different from how ML-based AIs are designed. Therefore there’s no prima facie reason to expect that a superintelligent AGI, designed with a very different mind architecture, would employ those human concepts. And so those aforementioned intuitions that argue against x-risk are unconvincing.
For example, if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
In contrast, if we encountered aliens, those would’ve presumably arisen from evolution, in which case their mind architectures would be closer to us than an artificially designed AGI, and this would make our intuitions comparatively more applicable. Although that wouldn’t suffice for value alignment with humanity. Related fiction: EY’s Three Worlds Collide.
if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
Somewhat disagree here—while we can’t use kindness to predict the internal “thought process” of the AI, [if we assume it’s not actively disobedient] the instructions mean that it will use an internal lossy model of what humans mean by kindness, and incorporate that into its act. Similar to how a talented human actor can realistically play a serial killer without having a “true” understanding of the urge to serially-kill people irl.
That’s a fair rebuttal. The actor analogy seems good: an actor will behave more or less like Abraham Lincoln in some situations, and very differently in others: e.g. on movie set vs. off movie set, vs. being with family, vs. being detained by police.
Similarly, the shoggoth will output similar tokens to Abraham Lincoln in some situations, and very different ones in others: e.g. in-distribution requests of famous Abraham Lincoln speeches, vs. out-of-distribution requests like asking for Abraham Lincoln’s opinions on 21st century art, vs. requests which invoke LLM token glitches like SolidGoldMagikarp, vs. unallowed requests that are denied by company policy & thus receive some boilerplate corporate response.
So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like “be helpful” and “don’t betray Eliezer” and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don’t follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)
I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection. E.g. where a human might think about a task in terms of required willpower, an AGI might instead have internal concepts for required power consumption or compute threads or something.
Since human brains all share more or less the same architecture, you’d only expect significant misalignment between them if specific brains differed a lot from one another: e.g. someone with brain damage vs. a genius, or (as per an ACX post) a normal human vs. some one-of-a-kind person who doesn’t experience suffering due to some genetic quirk.
Or suppose we could upload people: then a flesh-and-blood human with a physical brain would have a different internal architecture from a digital human with a digital brain simulated on physical computer hardware. In which case their reflective concepts might diverge insofar as the simulation was imperfect and leaked details about the computer hardware and its constraints.
So it sounds like you are saying, it’s a matter of degree, not kind: Two humans will have minor differences between each other, and some humans (such as those with genetic quirks) will have major differences between each other.) But AIs vs. humans will have lots of major differences between each other.
So, how much difference is too much then? What’s the case that the AI-to-human differences (which are undoubtedly larger than the human-to-human differences) are large enough to cause serious problems (even in worlds where we avoid problem #2).
I thought this is what the “Shoggoth” metaphor for LLMs and AI assistants is pointing at: When reasoning about nonhuman minds, we employ intuitions that we’d evolved to think about fellow humans. Consequently, many arguments against AI x-risk from superintelligent agents employ intuitions that route through human-flavored concepts like kindness, altruism, reciprocity, etc.
The strength or weakness of those kinds of arguments depends on the extent to which the superintelligent agent uses or thinks in those human concepts. But those concepts arose in humans through the process of evolution, which is very different from how ML-based AIs are designed. Therefore there’s no prima facie reason to expect that a superintelligent AGI, designed with a very different mind architecture, would employ those human concepts. And so those aforementioned intuitions that argue against x-risk are unconvincing.
For example, if I ask an AI assistant to respond as if it’s Abraham Lincoln, then human concepts like kindness are not good predictors for how the AI assistant will respond, because it’s not actually Abraham Lincoln, it’s more like a Shoggoth pretending to be Abraham Lincoln.
In contrast, if we encountered aliens, those would’ve presumably arisen from evolution, in which case their mind architectures would be closer to us than an artificially designed AGI, and this would make our intuitions comparatively more applicable. Although that wouldn’t suffice for value alignment with humanity. Related fiction: EY’s Three Worlds Collide.
Somewhat disagree here—while we can’t use kindness to predict the internal “thought process” of the AI, [if we assume it’s not actively disobedient] the instructions mean that it will use an internal lossy model of what humans mean by kindness, and incorporate that into its act. Similar to how a talented human actor can realistically play a serial killer without having a “true” understanding of the urge to serially-kill people irl.
That’s a fair rebuttal. The actor analogy seems good: an actor will behave more or less like Abraham Lincoln in some situations, and very differently in others: e.g. on movie set vs. off movie set, vs. being with family, vs. being detained by police.
Similarly, the shoggoth will output similar tokens to Abraham Lincoln in some situations, and very different ones in others: e.g. in-distribution requests of famous Abraham Lincoln speeches, vs. out-of-distribution requests like asking for Abraham Lincoln’s opinions on 21st century art, vs. requests which invoke LLM token glitches like SolidGoldMagikarp, vs. unallowed requests that are denied by company policy & thus receive some boilerplate corporate response.
“I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection.”
It is not just the internal architecture. An AGI will have a completely different set of actuators and sensors compared to humans.