I’ve read a lot of the doomer content on here about AGI and am still unconvinced that alignment seems difficult-by-default. I think if you generalize from the way humans are “aligned”, the prospect of aligning an AGI well looks pretty good. The pessimistic views on this seem to all come to the opposite conclusion by arguing “evolution failed to align humans, by its own standards”. However
Evolution isn’t an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter
The equivalent of evolution in the development of AGI is not the training of a model, it’s the process of researchers developing more sophisticated architectures. The training of a model is more like the equivalent of the early stages of life for a human.
If you follow models from developmental psychology (rather than evolutionary psychology, which is a standard frame on libertarian-adjacent blogs, but not in the psychological mainstream), alignment works almost too well. For instance in the standard picture from psychoanalysis a patient will go in to therapy for years attempting to rid himself of negative influence from a paternal superego without success. A sort of “ghost” of the judgment-wielding figure from the training stage will hover around him his entire life. Lacan, for example, asserts that to attempt to “forget” the role of the parental figure inevitably leads to psychosis, because any sort of coherent self is predicated on the symbolic presence of the parental figure—this is because the model of the self as an independent object separated from the world is created in the first place to facilitate an operating relation with the parent. Without any stable representation of the self, the different parts of the mind can’t function as a whole and break down into psychosis.
Translated to a neural network context, we can perhaps imagine that if a strategy model is trained in a learning regimen which involves receiving judgment on its actions from authority figures (perhaps simulating 100 different LLM chatbot personalities with different ethical orientations to make up for the fact that we don’t have a unified ethical theory & never will), it will never develop the pathways which would allow it to do evil acts, similar to how Stable Diffusion can’t draw pornography unless it’s fine-tuned.
Furthermore, I expect that the neural network’s learning regimen would be massively more effective than a standard routine of childhood discipline in yielding benevolence, because it would lack all the other reasons that humans discover to be selfish & malicious, & because you could deliberately train it on difficult ethical problems rather than a typical human having to extrapolate an ethical theory entirely from the much easier problems in the training environment of “how do I stay a good boy & maintain parental love”.
The pessimistic views on this seem to all come to the opposite conclusion by arguing “evolution failed to align humans, by its own standards”.
Which is just blatantly ridiculous; the human population of nearly 10B vs a few M for other primates is one of evolution’s greatest successes—by its own standards of inclusive genetic fitness.
Evolution solved alignment on two levels: intra-aligning brains with the goal of inclusive fitness (massively successful), and also inter-aligning the disposable soma brains to distributed shared kin gins via altruism.
I see your point, and I think it’s true right at this moment, but what if humans just haven’t yet taken the treacherous turn?
Say that humans figure out brain uploading, and it turns out that brain uploading does not require explicitly encoding genes/DNA, and humans collectively decide that uploading is better than remaining in our physical bodies, and so we all upload ourselves and begin reproducing digitally instead of thru genes. There is a sense in which we have just destroyed all value in the world, from the anthropomorphized Evolution’s perspective.
If we say that “evolutions goal” is to maximize the number of human genes that exist, then it has NOT done a good job at aligning humans in the limit as human capabilities go to infinity. We just havent reached the point yet where “humans following our own desires” starts to diverge with evolution’s goals. But given that humans do not care about our genes implicitly, there’s a good chance that such a point will come eventually.
So basically you admit that humans are currently an enormous success according to inclusive fitness, but at some point this will change—because in the future everyone will upload and humanity will go extinct.
Sorry but that is ridiculous. I’m all for uploading, but you are unjustifiably claiming enormous probability mass in a very specific implausible future. Even when/if uploading becomes available, it may never be affordable for all humans, and even if/when that changes, it seems unlikely that all humans would pursue it at the expense of reproduction. We are simply too diversified. There are still uncontacted peoples, left behind by both industrialization and modernization. There will be many left behind by uploading.
The more likely scenario is that humans persist and perhaps spread to the stars (or at least the solar system) even if AI/uploads spread farther faster and branch out to new niches. (In fact far future pure digital intelligences won’t have much need for earth-like planets or even planets at all and can fill various low-temperature niches unsuitable for bio-life).
Humanity did cause the extinction of ants, let alone bacteria, and it seems unlikely that future uploads will cause the extinction of bio humanity.
So basically you admit that humans are currently an enormous success according to inclusive fitness, but at some point this will change—because in the future everyone will upload and humanity will go extinct
Not quite—I take issue with the certainty of the word “will” and with the “because” clause in your quote. I would reword your statement the following way:
“Humans are currently an enormous success according to inclusive fitness, but at some point this may change, due to any number of possible reasons which all stem from the fact that humans do not explicitly care about / optimize for our genes”
Uploading is one example of how humans could become misaligned with genetic fitness, but there are plenty of other ways too. We could get really good at genetic engineering and massively reshape the human genome, leaving only very little of Evolution’s original design. Or we could accidentally introduce a technology that causes all humans to go extinct (nuclear war, AI, engineered pandemic).
(Side note: The whole point of being worried about misalignment is that it’s hard to tell in advance exactly how the misalignment is going to manifest. If you knew in advance how it was going to manifest, you could just add a quick fix onto your agent’s utility function, e.g. “and by the way also assign very low utility to uploading”. But I don’t think a quick fix like this is actually very helpful, because as long as the system is not explicitly optimizing for what you want it to, it’s always possible to find other ways the system’s behavior might not be what you want)
My point is that I’m not confident that humans will always be aligned with genetic fitness. So far, giving humans intelligence has seemed like Evolution’s greatest idea yet. If we were explicitly using our intelligence to maximize our genes’ prevalence, then that would probably always remain true. But instead we do things like create weapons arsenals that actually pose a significant risk to the continued existence of our genes. This is not what a well-aligned intelligence that is robust to future capability gains looks like.
humans do not explicitly care about / optimize for our genes
Ahh but they do. Humans generally do explicitly care about propagating their own progeny/bloodlines, and always have—long before the word ‘gene’. And this is still generally true today—adoption is last resort, not a first choice.
I’ll definitely agree that most people seem to prefer having their own kids to adopting kids. But is this really demonstrating an intrinsic desire to preserve our actual physical genes, or is it more just a generic desire to “feel like your kids are really yours”?
I think we can distinguish between these cases with a thought experiment: Imagine that genetic engineering techniques become available that give high IQs, strength, height, etc., and that prevent most genetic diseases. But, in order to implement these techniques, lots and lots of genes must be modified. Would parents want to use these techniques?
I myself certainly would, even though I am one of the people who would prefer to have my own kids vs adoption. For me, it seems that the genes themselves are not actually the reason I want my own kids. As long as I feel like the kids are “something I created”, or “really mine”, that’s enough to satisfy my natural tendencies. I suspect that most parents would feel similarly.
More specifically, I think what parents care about is that their kids kind of look like them, share some of their personality traits, “have their mother’s eyes”, etc. But I don’t think that anyone really cares how those things are implemented.
I want to say it, but my view is that evolution’s goals were very easy to reach, and in partucular, it can make use of the following assumptions:
Deceptive Alignment does not matter, in other words, so long as it reproduces, deception doesn’t matter. For most goals, deceptive alignment would entirely break the alignment, since we’re usually aiming for much more specific goals.
Instrumental goals can be used at least in part to do the task, that is instrumental convergence isn’t the threat it’s usually portrayed as.
There are other reasons, but these two are the main reasons why the alignment problem is so hard (without interpretability tools.)
How are we possibly aiming for “much more specific goals”—remember evolution intra-aligned brains to each other through altruism. We only need to improve on that.
And regardless we could completely ignore human values and just create AI that optimizes for human empowerment (maximization of our future optionality, or future potential to fulfill any goal).
Evolution isn’t an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter
My understanding of deep learning is that training is also roughly the repeated application of a filter. The filter is some loss function (or, potentially the LLM evaluators like you suggest) which repeatedly selects for a set of model weights that perform well according to that function, similar to how natural selection selects for individuals who are relatively fit. Humans designing ML systems can be careful about how to craft our loss functions, rather than arbitrary environmental factors determining what “fitness” means, but this does not guarantee that the models produced by this process actually do what we want. See inner misalignment for why models might not do what we want even if we put real effort into trying to get them to.
Even working in the analogy you propose, we have problems. Parents raising their kids often fail to instill important ideas they want to (many kids raised in extremely religious households later convert away).
I’ve read a lot of the doomer content on here about AGI and am still unconvinced that alignment seems difficult-by-default. I think if you generalize from the way humans are “aligned”, the prospect of aligning an AGI well looks pretty good. The pessimistic views on this seem to all come to the opposite conclusion by arguing “evolution failed to align humans, by its own standards”. However
Evolution isn’t an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter
The equivalent of evolution in the development of AGI is not the training of a model, it’s the process of researchers developing more sophisticated architectures. The training of a model is more like the equivalent of the early stages of life for a human.
If you follow models from developmental psychology (rather than evolutionary psychology, which is a standard frame on libertarian-adjacent blogs, but not in the psychological mainstream), alignment works almost too well. For instance in the standard picture from psychoanalysis a patient will go in to therapy for years attempting to rid himself of negative influence from a paternal superego without success. A sort of “ghost” of the judgment-wielding figure from the training stage will hover around him his entire life. Lacan, for example, asserts that to attempt to “forget” the role of the parental figure inevitably leads to psychosis, because any sort of coherent self is predicated on the symbolic presence of the parental figure—this is because the model of the self as an independent object separated from the world is created in the first place to facilitate an operating relation with the parent. Without any stable representation of the self, the different parts of the mind can’t function as a whole and break down into psychosis.
Translated to a neural network context, we can perhaps imagine that if a strategy model is trained in a learning regimen which involves receiving judgment on its actions from authority figures (perhaps simulating 100 different LLM chatbot personalities with different ethical orientations to make up for the fact that we don’t have a unified ethical theory & never will), it will never develop the pathways which would allow it to do evil acts, similar to how Stable Diffusion can’t draw pornography unless it’s fine-tuned.
Furthermore, I expect that the neural network’s learning regimen would be massively more effective than a standard routine of childhood discipline in yielding benevolence, because it would lack all the other reasons that humans discover to be selfish & malicious, & because you could deliberately train it on difficult ethical problems rather than a typical human having to extrapolate an ethical theory entirely from the much easier problems in the training environment of “how do I stay a good boy & maintain parental love”.
Which is just blatantly ridiculous; the human population of nearly 10B vs a few M for other primates is one of evolution’s greatest successes—by its own standards of inclusive genetic fitness.
Evolution solved alignment on two levels: intra-aligning brains with the goal of inclusive fitness (massively successful), and also inter-aligning the disposable soma brains to distributed shared kin gins via altruism.
I see your point, and I think it’s true right at this moment, but what if humans just haven’t yet taken the treacherous turn?
Say that humans figure out brain uploading, and it turns out that brain uploading does not require explicitly encoding genes/DNA, and humans collectively decide that uploading is better than remaining in our physical bodies, and so we all upload ourselves and begin reproducing digitally instead of thru genes. There is a sense in which we have just destroyed all value in the world, from the anthropomorphized Evolution’s perspective.
If we say that “evolutions goal” is to maximize the number of human genes that exist, then it has NOT done a good job at aligning humans in the limit as human capabilities go to infinity. We just havent reached the point yet where “humans following our own desires” starts to diverge with evolution’s goals. But given that humans do not care about our genes implicitly, there’s a good chance that such a point will come eventually.
So basically you admit that humans are currently an enormous success according to inclusive fitness, but at some point this will change—because in the future everyone will upload and humanity will go extinct.
Sorry but that is ridiculous. I’m all for uploading, but you are unjustifiably claiming enormous probability mass in a very specific implausible future. Even when/if uploading becomes available, it may never be affordable for all humans, and even if/when that changes, it seems unlikely that all humans would pursue it at the expense of reproduction. We are simply too diversified. There are still uncontacted peoples, left behind by both industrialization and modernization. There will be many left behind by uploading.
The more likely scenario is that humans persist and perhaps spread to the stars (or at least the solar system) even if AI/uploads spread farther faster and branch out to new niches. (In fact far future pure digital intelligences won’t have much need for earth-like planets or even planets at all and can fill various low-temperature niches unsuitable for bio-life).
Humanity did cause the extinction of ants, let alone bacteria, and it seems unlikely that future uploads will cause the extinction of bio humanity.
Not quite—I take issue with the certainty of the word “will” and with the “because” clause in your quote. I would reword your statement the following way:
“Humans are currently an enormous success according to inclusive fitness, but at some point this may change, due to any number of possible reasons which all stem from the fact that humans do not explicitly care about / optimize for our genes”
Uploading is one example of how humans could become misaligned with genetic fitness, but there are plenty of other ways too. We could get really good at genetic engineering and massively reshape the human genome, leaving only very little of Evolution’s original design. Or we could accidentally introduce a technology that causes all humans to go extinct (nuclear war, AI, engineered pandemic).
(Side note: The whole point of being worried about misalignment is that it’s hard to tell in advance exactly how the misalignment is going to manifest. If you knew in advance how it was going to manifest, you could just add a quick fix onto your agent’s utility function, e.g. “and by the way also assign very low utility to uploading”. But I don’t think a quick fix like this is actually very helpful, because as long as the system is not explicitly optimizing for what you want it to, it’s always possible to find other ways the system’s behavior might not be what you want)
My point is that I’m not confident that humans will always be aligned with genetic fitness. So far, giving humans intelligence has seemed like Evolution’s greatest idea yet. If we were explicitly using our intelligence to maximize our genes’ prevalence, then that would probably always remain true. But instead we do things like create weapons arsenals that actually pose a significant risk to the continued existence of our genes. This is not what a well-aligned intelligence that is robust to future capability gains looks like.
Ahh but they do. Humans generally do explicitly care about propagating their own progeny/bloodlines, and always have—long before the word ‘gene’. And this is still generally true today—adoption is last resort, not a first choice.
I’ll definitely agree that most people seem to prefer having their own kids to adopting kids. But is this really demonstrating an intrinsic desire to preserve our actual physical genes, or is it more just a generic desire to “feel like your kids are really yours”?
I think we can distinguish between these cases with a thought experiment: Imagine that genetic engineering techniques become available that give high IQs, strength, height, etc., and that prevent most genetic diseases. But, in order to implement these techniques, lots and lots of genes must be modified. Would parents want to use these techniques?
I myself certainly would, even though I am one of the people who would prefer to have my own kids vs adoption. For me, it seems that the genes themselves are not actually the reason I want my own kids. As long as I feel like the kids are “something I created”, or “really mine”, that’s enough to satisfy my natural tendencies. I suspect that most parents would feel similarly.
More specifically, I think what parents care about is that their kids kind of look like them, share some of their personality traits, “have their mother’s eyes”, etc. But I don’t think that anyone really cares how those things are implemented.
I want to say it, but my view is that evolution’s goals were very easy to reach, and in partucular, it can make use of the following assumptions:
Deceptive Alignment does not matter, in other words, so long as it reproduces, deception doesn’t matter. For most goals, deceptive alignment would entirely break the alignment, since we’re usually aiming for much more specific goals.
Instrumental goals can be used at least in part to do the task, that is instrumental convergence isn’t the threat it’s usually portrayed as.
There are other reasons, but these two are the main reasons why the alignment problem is so hard (without interpretability tools.)
How are we possibly aiming for “much more specific goals”—remember evolution intra-aligned brains to each other through altruism. We only need to improve on that.
And regardless we could completely ignore human values and just create AI that optimizes for human empowerment (maximization of our future optionality, or future potential to fulfill any goal).
My understanding of deep learning is that training is also roughly the repeated application of a filter. The filter is some loss function (or, potentially the LLM evaluators like you suggest) which repeatedly selects for a set of model weights that perform well according to that function, similar to how natural selection selects for individuals who are relatively fit. Humans designing ML systems can be careful about how to craft our loss functions, rather than arbitrary environmental factors determining what “fitness” means, but this does not guarantee that the models produced by this process actually do what we want. See inner misalignment for why models might not do what we want even if we put real effort into trying to get them to.
Even working in the analogy you propose, we have problems. Parents raising their kids often fail to instill important ideas they want to (many kids raised in extremely religious households later convert away).