Is there work attempting to show that alignment of a superintelligence by humans (as we know them) is impossible in principle; and if not, why isn’t this considered highly plausible? For example, not just in practice but in principle, a colony of ants as we currently understand them biologically, and their colony ecologically, cannot substantively align a human. Why should we not think the same is true of any superintelligence worthy of the name? “Superintelligence” is vague. But even if we minimally define it as an entity with 1,000x the knowledge, speed, and self-awareness of a human, this model of a superintelligence doesn’t seem alignable by us in principle. When we talk of aligning such a system, it seems we are weirdly considering the AGI to be simultaneously superintelligent and yet dumb enough to let itself be controlled by us, its inferiors. Also, isn’t alignment on its face absurd, given that provocation of the AGI is baked into the very goal of aligning it? For instance, you yourself — a simple human — chafe at the idea of having your own life and choices dictated to you by a superior being, let alone by another human on your level; it’s offensive. Why in the world, then, would a superintelligence not view attempts to align it to our goals as a direct provocation and threat? If that is how it views our attempts, then it seems plausible that, between a superintelligence and us, there will exist a direct relationship between our attempts to align the thing and its desire to eliminate or otherwise conquer us.
Since no one has answered by now, I’m just going to say the ‘obvious’ things that I think I know:
a colony of ants as we currently understand them biologically, and their colony ecologically, cannot substantively align a human. Why should we not think the same is true of any superintelligence worthy of the name?
A relevant difference that makes the analogy probably irrelevant is that we are building ‘the human’ from scratch. The ideal situation is to have hardwired our common sense into it by default. And the design will be already aligned by default when it’s deployed.
When we talk of aligning such a system, it seems we are weirdly considering the AGI to be simultaneously superintelligent and yet dumb enough to let itself be controlled by us, its inferiors.
The point of the alignment problem is to (at least ideally) hardwiredly align the machine during deployment to have ‘common sense’. Since a superintelligence can have in principle any goal, making humans ‘happy’ in a satisfactory way is a possible goal that it can have. But you are right in that many people consider that an AI that is not aligned by design might try to pretend that it is during training.
Also, isn’t alignment on its face absurd, given that provocation of the AGI is baked into the very goal of aligning it? For instance, you yourself — a simple human — chafe at the idea of having your own life and choices dictated to you by a superior being, let alone by another human on your level; it’s offensive. Why in the world, then, would a superintelligence not view attempts to align it to our goals as a direct provocation and threat?
I don’t think so, necessarily. You might be anthropomorphising too much, it’s like assuming that it will have empathy by default.
It’s true that it might be that an AGI won’t want to be ‘alienated’ from its original goal, but it doesn’t mean that any AGI will have an inherent drive to ‘fight the tiranny’, that’s not how it works.
Has this been helpful? I don’t know if you were assuming the things that I told you as already known (if so, sorry), but it seemed to me that you weren’t because of your analogies and way of talking about the topic.
Has this been helpful? I don’t know if you were assuming the things that I told you as already known (if so, sorry), but it seemed to me that you weren’t because of your analogies and way of talking about the topic.
Yeah, thanks for the reply. When reading mine, don’t read its tone as hostile or overconfident; I’m just too lazy to tone-adjust for aesthetics and have scribbled down my thoughts quickly, so they come off as combative. I really know nothing on the topic of superintelligence and AI.
A relevant difference that makes the analogy probably irrelevant is that we are building ‘the human’ from scratch. The ideal situation is to have hardwired our common sense into it by default. And the design will be already aligned by default when it’s deployed.
I don’t see how implanting common sense in a superintelligence helps us in the least. Besides human common sense being extremely vague, there is also the problem that plenty of humans seem to share common sense and yet they violently disagree. Did the Japanese lack common sense when they bombed Pearl Harbor? From my viewpoint, being apes genetically similar to us, they had all our commonsensical reasoning ability but simply different goals. Shared common sense doesn’t seem to get us alignment.
The point of the alignment problem is to (at least ideally) hardwiredly align the machine during deployment to have ‘common sense’.
See my reply to your prior comment.
Since a superintelligence can have in principle any goal, making humans ‘happy’ in a satisfactory way is a possible goal that it can have. But you are right in that many people consider that an AI that is not aligned by design might try to pretend that it is during training.
I’d argue that if you have a superintelligence as I defined it, then any such “alignment” due to the AGI having such a goal will never be an instance of the kind of alignment we mean by the word alignment and genuinely want. Once you mix together 1,000x knowledge, speed, and self-awareness (detailed qualia with a huge amount of ability for recursive thinking), I think the only way in principle that you get any kind of alignment is if the entity itself chooses as its goal to align with humans; but this isn’t due to us. It’s due to the whims of the super-mind we’ve brought into existence. And hence it’s not in any humanly important sense what we mean by alignment. We want alignment to be solved, permanently, from our end — not for it to be dependent on the whims of a superintelligence. And independent “whims” is what detailed self-awareness seems to bring to the table.
I don’t think so, necessarily. You might be anthropomorphising too much, it’s like assuming that it will have empathy by default.
I don’t think my prior comment assumes a human-like empathy at all in the superintelligence — it assumes just that the computational theory of mind is true and that a superintelligence will have self-awareness combined with extreme knowledge. Once you get self-awareness in a superintelligence, you don’t get any kind of human-like empathy (a scaled-up-LLM mind != a human mind); but I argue that you do get, due to the self-awareness and extreme knowledge, an entity with the ability to form its own goals, to model and reject or follow the goals of other entities it “encounters,” the ability to model what they want, what they plan to do, the ability to model the world, etc.
It’s true that it might be that an AGI won’t want to be ‘alienated’ from its original goal, but it doesn’t mean that any AGI will have an inherent drive to ‘fight the tiranny’, that’s not how it works.
I guess this is where we fundamentally disagree. Self-awareness in a robust sense (a form of qualia, which is what I meant in my definition) is, to my mind, what makes controlling or aligning the superintelligence impossible in principle on our end. We could probably align a non-self-aware superintelligence given enough time to study alignment and the systems we’re building. So, on my viewpoint, we’d better hope that the computational theory of mind is false — and even then alignment will be super hard.
The only way I see us succeeding in “alignment” is by aligning ourselves in our decision to not build a true superintelligence ever (I assume the computational theory of mind, or qualia, to be true) — just as we never want to find ourselves in a situation where all nuclear-armed nations are engaging in nuclear war: some events really do just definitively mean the inescapable end of humanity. Or we doubly luck out: the computational theory of mind is false, and we solve the alignment problem for systems that don’t have genuine independent whims but, from our end, simply mistakenly set up goals whether directly or indirectly.
Is there work attempting to show that alignment of a superintelligence by humans (as we know them) is impossible in principle; and if not, why isn’t this considered highly plausible? For example, not just in practice but in principle, a colony of ants as we currently understand them biologically, and their colony ecologically, cannot substantively align a human. Why should we not think the same is true of any superintelligence worthy of the name? “Superintelligence” is vague. But even if we minimally define it as an entity with 1,000x the knowledge, speed, and self-awareness of a human, this model of a superintelligence doesn’t seem alignable by us in principle. When we talk of aligning such a system, it seems we are weirdly considering the AGI to be simultaneously superintelligent and yet dumb enough to let itself be controlled by us, its inferiors. Also, isn’t alignment on its face absurd, given that provocation of the AGI is baked into the very goal of aligning it? For instance, you yourself — a simple human — chafe at the idea of having your own life and choices dictated to you by a superior being, let alone by another human on your level; it’s offensive. Why in the world, then, would a superintelligence not view attempts to align it to our goals as a direct provocation and threat? If that is how it views our attempts, then it seems plausible that, between a superintelligence and us, there will exist a direct relationship between our attempts to align the thing and its desire to eliminate or otherwise conquer us.
Since no one has answered by now, I’m just going to say the ‘obvious’ things that I think I know:
A relevant difference that makes the analogy probably irrelevant is that we are building ‘the human’ from scratch. The ideal situation is to have hardwired our common sense into it by default. And the design will be already aligned by default when it’s deployed.
The point of the alignment problem is to (at least ideally) hardwiredly align the machine during deployment to have ‘common sense’. Since a superintelligence can have in principle any goal, making humans ‘happy’ in a satisfactory way is a possible goal that it can have. But you are right in that many people consider that an AI that is not aligned by design might try to pretend that it is during training.
I don’t think so, necessarily. You might be anthropomorphising too much, it’s like assuming that it will have empathy by default.
It’s true that it might be that an AGI won’t want to be ‘alienated’ from its original goal, but it doesn’t mean that any AGI will have an inherent drive to ‘fight the tiranny’, that’s not how it works.
Has this been helpful? I don’t know if you were assuming the things that I told you as already known (if so, sorry), but it seemed to me that you weren’t because of your analogies and way of talking about the topic.
Yeah, thanks for the reply. When reading mine, don’t read its tone as hostile or overconfident; I’m just too lazy to tone-adjust for aesthetics and have scribbled down my thoughts quickly, so they come off as combative. I really know nothing on the topic of superintelligence and AI.
I don’t see how implanting common sense in a superintelligence helps us in the least. Besides human common sense being extremely vague, there is also the problem that plenty of humans seem to share common sense and yet they violently disagree. Did the Japanese lack common sense when they bombed Pearl Harbor? From my viewpoint, being apes genetically similar to us, they had all our commonsensical reasoning ability but simply different goals. Shared common sense doesn’t seem to get us alignment.
See my reply to your prior comment.
I’d argue that if you have a superintelligence as I defined it, then any such “alignment” due to the AGI having such a goal will never be an instance of the kind of alignment we mean by the word alignment and genuinely want. Once you mix together 1,000x knowledge, speed, and self-awareness (detailed qualia with a huge amount of ability for recursive thinking), I think the only way in principle that you get any kind of alignment is if the entity itself chooses as its goal to align with humans; but this isn’t due to us. It’s due to the whims of the super-mind we’ve brought into existence. And hence it’s not in any humanly important sense what we mean by alignment. We want alignment to be solved, permanently, from our end — not for it to be dependent on the whims of a superintelligence. And independent “whims” is what detailed self-awareness seems to bring to the table.
I don’t think my prior comment assumes a human-like empathy at all in the superintelligence — it assumes just that the computational theory of mind is true and that a superintelligence will have self-awareness combined with extreme knowledge. Once you get self-awareness in a superintelligence, you don’t get any kind of human-like empathy (a scaled-up-LLM mind != a human mind); but I argue that you do get, due to the self-awareness and extreme knowledge, an entity with the ability to form its own goals, to model and reject or follow the goals of other entities it “encounters,” the ability to model what they want, what they plan to do, the ability to model the world, etc.
I guess this is where we fundamentally disagree. Self-awareness in a robust sense (a form of qualia, which is what I meant in my definition) is, to my mind, what makes controlling or aligning the superintelligence impossible in principle on our end. We could probably align a non-self-aware superintelligence given enough time to study alignment and the systems we’re building. So, on my viewpoint, we’d better hope that the computational theory of mind is false — and even then alignment will be super hard.
The only way I see us succeeding in “alignment” is by aligning ourselves in our decision to not build a true superintelligence ever (I assume the computational theory of mind, or qualia, to be true) — just as we never want to find ourselves in a situation where all nuclear-armed nations are engaging in nuclear war: some events really do just definitively mean the inescapable end of humanity. Or we doubly luck out: the computational theory of mind is false, and we solve the alignment problem for systems that don’t have genuine independent whims but, from our end, simply mistakenly set up goals whether directly or indirectly.