My main disagreement is that I actually do think that at least some of the critiques are right here.
In particular, the claims that Quintin Pope is making that I think are right is that evolution is extremely different from how we train our AIs, and thus none of the inferences that work under an evolution model work under the AIs under consideration, which importantly includes a lot of analogies to apes/Neanderthals making smarter humans (which they didn’t do, BTW.), which presumably failed to be aligned, ergo we can’t align AI smarter than us.
The basic issue though is that evolution doesn’t have a purpose or goal, and thus the common claim that evolution failed to align humans to X thing is nonsensical, as it assumes a teleological goal that just does not exist in evolution, which is quite different from humans making AIs with particular goals in mind. Thus talk of an alignment problem between say chimps/Neanderthals and humans is entirely nonsensical. This is also why this generalized example of misgeneralization fails to work, since evolution is not a trainer or designer in the way that say. an OpenAI employee making AI would be, and thus there is no generalization error, since there wasn’t a goal or behavior to purposefully generalize in the first place:
“In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead.”
There are other problems with the analogy that Quintin Pope covered, like the fact that it doesn’t actually capture misgeneralization correctly, since the ancient/modern human distinction is not the same as one AI doing a treacherous turn, or how the example of ice cream overwhelming our reward center isn’t misgeneralization, but the fact that evolution has no purpose or goal is the main problem I see with a lot of evolution analogies.
Another issue is that evolution is extremely inefficient at the timescales required, which is why dominant training methods for AI borrow little from evolution at best, and even from an AI capabilities perspective it’s not really worth it to rerun evolution to get AI progress.
Some other criticisms I agree with from Quintin Pope is that current AI can already self-improve, albeit more weakly and having more limits than humans, though I agree way less strongly here than Quintin Pope, and that the security mindset is very misleading and predicts things in ML that don’t actually happen at all, which is why I don’t think adversarial assumptions are good unless you can solve the problem in the worst case easily or just as easily as the non-adversarial cases.
The basic issue though is that evolution doesn’t have a purpose or goal
FWIW, I don’t think this is the main issue with the evolution analogy. The main issue is that evolution faced a series of basically insurmountable, yet evolution-specific, challenges in successfully generalizing human ‘value alignment’ to the modern environment, such as the fact that optimization over the genome can only influence within lifetime value formation theough insanely unstable Rube Goldberg-esque mechanisms that rely on steps like “successfully zero-shot directing an organism’s online learning processes through novel environments via reward shaping”, or the fact that accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors to act as an anchor against value drift, or evolution having a massive optimization power overhang in the inner loop of its optimization process.
These issues fully explain away the ‘misalignment’ humans have with IGF and other intergenerational value instability. If we imagine a deep learning optimization process with an equivalent structure to evolution, then we could easily predict similar stability issues would arise due to that unstable structure, without having to posit an additional “general tendency for inner misalignment” in arbitrary optimization processes, which is the conclusion that Yudkowsky and others typically invoke evolution to support.
In other words, the issues with evolution as an analogy have little to do with the goals we might ascribe to DL/evolutionary optimization processes, and everything to do with simple mechanistic differences in structure between those processes.
I’m curious to hear more about this. Reviewing the analogy:
Evolution, ‘trying’ to get general intelligences that are great at reproducing <--> The AI Industry / AI Corporations, ‘trying’ to get AGIs that are HHH Genes, instructing cells on how to behave and connect to each other and in particular how synapses should update their ‘weights’ in response to the environment <--> Code, instructing GPUs on how to behave and in particular how ‘weights’ in the neural net should update in response to the environment Brains, growing and learning over the course of lifetime <--> Weights, changing and learning over the course of training
Now turning to your three points about evolution:
Optimizing the genome indirectly influences value formation within lifetime, via this unstable Rube Goldberg mechanism that has to zero-shot direct an organism’s online learning processes through novel environments via reward shaping --> translating that into the analogy, it would be “optimizing the code indirectly influences value formation over the course of training, via this unstable Rube Goldberg mechanism that has to zero-shot direct the model’s learning process through novel environments vai reward shaping… yep seems to check out. idk. What do you think?
Accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors --> Accumulated learning in the weights is mostly reset when new models are trained since they are randomly initialized; fortunately there is a lot of overlap in training environment (internet text doesn’t change that much from model to model) and also you can use previous models as RLAIF supervisors… (though isn’t that also analogous to how humans generally have a lot of shared text and culture that spans generations, and also each generation of humans literally supervises and teaches the next?)
Massive optimization power overhang in the inner loop of its optimization process --> isn’t this increasingly true of AI too? Maybe I don’t know what you mean here. Can you elaborate?
My main disagreement is that I actually do think that at least some of the critiques are right here.
In particular, the claims that Quintin Pope is making that I think are right is that evolution is extremely different from how we train our AIs, and thus none of the inferences that work under an evolution model work under the AIs under consideration, which importantly includes a lot of analogies to apes/Neanderthals making smarter humans (which they didn’t do, BTW.), which presumably failed to be aligned, ergo we can’t align AI smarter than us.
The basic issue though is that evolution doesn’t have a purpose or goal, and thus the common claim that evolution failed to align humans to X thing is nonsensical, as it assumes a teleological goal that just does not exist in evolution, which is quite different from humans making AIs with particular goals in mind. Thus talk of an alignment problem between say chimps/Neanderthals and humans is entirely nonsensical. This is also why this generalized example of misgeneralization fails to work, since evolution is not a trainer or designer in the way that say. an OpenAI employee making AI would be, and thus there is no generalization error, since there wasn’t a goal or behavior to purposefully generalize in the first place:
There are other problems with the analogy that Quintin Pope covered, like the fact that it doesn’t actually capture misgeneralization correctly, since the ancient/modern human distinction is not the same as one AI doing a treacherous turn, or how the example of ice cream overwhelming our reward center isn’t misgeneralization, but the fact that evolution has no purpose or goal is the main problem I see with a lot of evolution analogies.
Another issue is that evolution is extremely inefficient at the timescales required, which is why dominant training methods for AI borrow little from evolution at best, and even from an AI capabilities perspective it’s not really worth it to rerun evolution to get AI progress.
Some other criticisms I agree with from Quintin Pope is that current AI can already self-improve, albeit more weakly and having more limits than humans, though I agree way less strongly here than Quintin Pope, and that the security mindset is very misleading and predicts things in ML that don’t actually happen at all, which is why I don’t think adversarial assumptions are good unless you can solve the problem in the worst case easily or just as easily as the non-adversarial cases.
FWIW, I don’t think this is the main issue with the evolution analogy. The main issue is that evolution faced a series of basically insurmountable, yet evolution-specific, challenges in successfully generalizing human ‘value alignment’ to the modern environment, such as the fact that optimization over the genome can only influence within lifetime value formation theough insanely unstable Rube Goldberg-esque mechanisms that rely on steps like “successfully zero-shot directing an organism’s online learning processes through novel environments via reward shaping”, or the fact that accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors to act as an anchor against value drift, or evolution having a massive optimization power overhang in the inner loop of its optimization process.
These issues fully explain away the ‘misalignment’ humans have with IGF and other intergenerational value instability. If we imagine a deep learning optimization process with an equivalent structure to evolution, then we could easily predict similar stability issues would arise due to that unstable structure, without having to posit an additional “general tendency for inner misalignment” in arbitrary optimization processes, which is the conclusion that Yudkowsky and others typically invoke evolution to support.
In other words, the issues with evolution as an analogy have little to do with the goals we might ascribe to DL/evolutionary optimization processes, and everything to do with simple mechanistic differences in structure between those processes.
I’m curious to hear more about this. Reviewing the analogy:
Evolution, ‘trying’ to get general intelligences that are great at reproducing <--> The AI Industry / AI Corporations, ‘trying’ to get AGIs that are HHH
Genes, instructing cells on how to behave and connect to each other and in particular how synapses should update their ‘weights’ in response to the environment <--> Code, instructing GPUs on how to behave and in particular how ‘weights’ in the neural net should update in response to the environment
Brains, growing and learning over the course of lifetime <--> Weights, changing and learning over the course of training
Now turning to your three points about evolution:
Optimizing the genome indirectly influences value formation within lifetime, via this unstable Rube Goldberg mechanism that has to zero-shot direct an organism’s online learning processes through novel environments via reward shaping --> translating that into the analogy, it would be “optimizing the code indirectly influences value formation over the course of training, via this unstable Rube Goldberg mechanism that has to zero-shot direct the model’s learning process through novel environments vai reward shaping… yep seems to check out. idk. What do you think?
Accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors --> Accumulated learning in the weights is mostly reset when new models are trained since they are randomly initialized; fortunately there is a lot of overlap in training environment (internet text doesn’t change that much from model to model) and also you can use previous models as RLAIF supervisors… (though isn’t that also analogous to how humans generally have a lot of shared text and culture that spans generations, and also each generation of humans literally supervises and teaches the next?)
Massive optimization power overhang in the inner loop of its optimization process --> isn’t this increasingly true of AI too? Maybe I don’t know what you mean here. Can you elaborate?