Apparently Eliezer decided to not take the time to read e.g. Quintin Pope’s actual critiques, but he does have time to write a long chain of strawmen and smears-by-analogy.
A lot of Quintin Pope’s critiques are just obviously wrong and lots of commenters were offering to help correct them. In such a case, it seems legitimate to me for a busy person to request that Quintin sorts out the problems together with the commenters before spending time on it. Even from the perspective of correcting and informing Eliezer, people can more effectively be corrected and informed if their attention is guided to the right place, with junk/distractions removed.
(Note: I mainly say this because I think the main point of the message you and Quintin are raising does not stand up to scrutiny, and so I mainly think the value the message can provide is in certain technical corrections that you don’t emphasize as much, even if strictly speaking they are part of your message. If I thought the main point of your message stood up to scrutiny, I’d also think it would be Eliezer’s job to realize it despite the inconvenience.)
I stand by pretty much everything I wrote in Objections, with the partial exception of the stuff about strawberry alignment, which I should probably rewrite at some point.
Also, Yudkowsky explained exactly how he’d prefer someone to engage with his position “To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.”, which I pointed out I’d previously done in a post that literally quotes exactly one point from LoL and explains why it’s wrong. I’ve gotten no response from him on that post, so it seems clear that Yudkowsky isn’t running an optimal ‘good discourse promoting’ engagement policy.
I don’t hold that against him, though. I personally hate arguing with people on this site.
Unless I’m greatly misremembering, you did pick out what you said was your strongest item from Lethalities, separately from this, and I responded to it. You’d just straightforwardly misunderstood my argument in that case, so it wasn’t a long response, but I responded. Asking for a second try is one thing, but I don’t think it’s cool to act like you never picked out any one item or I never responded to it.
I’m kind of ambivalent about this. On the one hand, when there is a misunderstanding, but he claims his argument still goes through after correcting the misunderstanding, it seems like you should also address that corrected form. On the other hand, Quintin Pope’s correction does seem very silly. At least by my analysis:
Similarly, the reason that “GPT-4 does not get smarter each time an instance of it is run in inference mode” is because it’s not programmed to do that[7]. OpenAI could[8] continuously train its models on the inputs you give it, such that the model adapts to your particular interaction style and content, even during the course of a single conversation, similar to the approach suggested in this paper. Doing so would be significantly more expensive and complicated on the backend, and it would also open GPT-4 up to data poisoning attacks.
This approach considers only the things OpenAI could do with their current ChatGPT setup, and yes it’s correct that there’s not much online learning opportunity in this. But that’s precisely why you’d expect GPT+DPO to not be the future of AI; Quintin Pope has clearly identified a capabilities bottleneck that prevents it from staying fully competitive. (Note that humans can learn even if there is a fraction of people who are sharing intentionally malicious information, because unlike GPT and DPO, humans don’t believe everything we’re told.)
A more autonomous AI could collect actionable information at much greater scale, as it wouldn’t be dependent on trusting its users for evaluating what information to update on, and it would have much more information about what’s going on than the chat-based I/O.
This sure does look to me like a huge bottleneck that’s blocking current AI methods, analogous to the evolutionary bottleneck: The full power of the AI cannot be used to accumulate OOM more information to further improve the power of the AI.
My main disagreement is that I actually do think that at least some of the critiques are right here.
In particular, the claims that Quintin Pope is making that I think are right is that evolution is extremely different from how we train our AIs, and thus none of the inferences that work under an evolution model work under the AIs under consideration, which importantly includes a lot of analogies to apes/Neanderthals making smarter humans (which they didn’t do, BTW.), which presumably failed to be aligned, ergo we can’t align AI smarter than us.
The basic issue though is that evolution doesn’t have a purpose or goal, and thus the common claim that evolution failed to align humans to X thing is nonsensical, as it assumes a teleological goal that just does not exist in evolution, which is quite different from humans making AIs with particular goals in mind. Thus talk of an alignment problem between say chimps/Neanderthals and humans is entirely nonsensical. This is also why this generalized example of misgeneralization fails to work, since evolution is not a trainer or designer in the way that say. an OpenAI employee making AI would be, and thus there is no generalization error, since there wasn’t a goal or behavior to purposefully generalize in the first place:
“In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead.”
There are other problems with the analogy that Quintin Pope covered, like the fact that it doesn’t actually capture misgeneralization correctly, since the ancient/modern human distinction is not the same as one AI doing a treacherous turn, or how the example of ice cream overwhelming our reward center isn’t misgeneralization, but the fact that evolution has no purpose or goal is the main problem I see with a lot of evolution analogies.
Another issue is that evolution is extremely inefficient at the timescales required, which is why dominant training methods for AI borrow little from evolution at best, and even from an AI capabilities perspective it’s not really worth it to rerun evolution to get AI progress.
Some other criticisms I agree with from Quintin Pope is that current AI can already self-improve, albeit more weakly and having more limits than humans, though I agree way less strongly here than Quintin Pope, and that the security mindset is very misleading and predicts things in ML that don’t actually happen at all, which is why I don’t think adversarial assumptions are good unless you can solve the problem in the worst case easily or just as easily as the non-adversarial cases.
The basic issue though is that evolution doesn’t have a purpose or goal
FWIW, I don’t think this is the main issue with the evolution analogy. The main issue is that evolution faced a series of basically insurmountable, yet evolution-specific, challenges in successfully generalizing human ‘value alignment’ to the modern environment, such as the fact that optimization over the genome can only influence within lifetime value formation theough insanely unstable Rube Goldberg-esque mechanisms that rely on steps like “successfully zero-shot directing an organism’s online learning processes through novel environments via reward shaping”, or the fact that accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors to act as an anchor against value drift, or evolution having a massive optimization power overhang in the inner loop of its optimization process.
These issues fully explain away the ‘misalignment’ humans have with IGF and other intergenerational value instability. If we imagine a deep learning optimization process with an equivalent structure to evolution, then we could easily predict similar stability issues would arise due to that unstable structure, without having to posit an additional “general tendency for inner misalignment” in arbitrary optimization processes, which is the conclusion that Yudkowsky and others typically invoke evolution to support.
In other words, the issues with evolution as an analogy have little to do with the goals we might ascribe to DL/evolutionary optimization processes, and everything to do with simple mechanistic differences in structure between those processes.
I’m curious to hear more about this. Reviewing the analogy:
Evolution, ‘trying’ to get general intelligences that are great at reproducing <--> The AI Industry / AI Corporations, ‘trying’ to get AGIs that are HHH Genes, instructing cells on how to behave and connect to each other and in particular how synapses should update their ‘weights’ in response to the environment <--> Code, instructing GPUs on how to behave and in particular how ‘weights’ in the neural net should update in response to the environment Brains, growing and learning over the course of lifetime <--> Weights, changing and learning over the course of training
Now turning to your three points about evolution:
Optimizing the genome indirectly influences value formation within lifetime, via this unstable Rube Goldberg mechanism that has to zero-shot direct an organism’s online learning processes through novel environments via reward shaping --> translating that into the analogy, it would be “optimizing the code indirectly influences value formation over the course of training, via this unstable Rube Goldberg mechanism that has to zero-shot direct the model’s learning process through novel environments vai reward shaping… yep seems to check out. idk. What do you think?
Accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors --> Accumulated learning in the weights is mostly reset when new models are trained since they are randomly initialized; fortunately there is a lot of overlap in training environment (internet text doesn’t change that much from model to model) and also you can use previous models as RLAIF supervisors… (though isn’t that also analogous to how humans generally have a lot of shared text and culture that spans generations, and also each generation of humans literally supervises and teaches the next?)
Massive optimization power overhang in the inner loop of its optimization process --> isn’t this increasingly true of AI too? Maybe I don’t know what you mean here. Can you elaborate?
Filtering away mistakes, unimportant points, unnecessary complications, etc., from preexisting ideas is (as long as the core idea one extracts is good) a very general way to contribute value, because it makes the ideas involved easier to understand.
Adding stronger arguments, more informative and accessible examples, etc. contributes value because then it shows what is more robust and gives more material to dig down into understanding it, and also because it clarifies why some people may find the idea attractive.
Explanations for the changes, especially for the dropped things, can build value because it clarifies the consensus about what parts were wrong, and if Quintin disagrees with the removals, it provides signals to him about what he didn’t clarify well enough.
When these are done on a sufficiently important point, with sufficiently much skill, and maybe also with sufficiently much luck, this can in principle provide a ton of value, both because information in general is high-leverage due to being easily shareable, and because this particular form of information can help resolve conflicts and rebuild trust.
A lot of Quintin Pope’s critiques are just obviously wrong and lots of commenters were offering to help correct them. In such a case, it seems legitimate to me for a busy person to request that Quintin sorts out the problems together with the commenters before spending time on it. Even from the perspective of correcting and informing Eliezer, people can more effectively be corrected and informed if their attention is guided to the right place, with junk/distractions removed.
(Note: I mainly say this because I think the main point of the message you and Quintin are raising does not stand up to scrutiny, and so I mainly think the value the message can provide is in certain technical corrections that you don’t emphasize as much, even if strictly speaking they are part of your message. If I thought the main point of your message stood up to scrutiny, I’d also think it would be Eliezer’s job to realize it despite the inconvenience.)
I stand by pretty much everything I wrote in Objections, with the partial exception of the stuff about strawberry alignment, which I should probably rewrite at some point.
Also, Yudkowsky explained exactly how he’d prefer someone to engage with his position “To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.”, which I pointed out I’d previously done in a post that literally quotes exactly one point from LoL and explains why it’s wrong. I’ve gotten no response from him on that post, so it seems clear that Yudkowsky isn’t running an optimal ‘good discourse promoting’ engagement policy.
I don’t hold that against him, though. I personally hate arguing with people on this site.
Unless I’m greatly misremembering, you did pick out what you said was your strongest item from Lethalities, separately from this, and I responded to it. You’d just straightforwardly misunderstood my argument in that case, so it wasn’t a long response, but I responded. Asking for a second try is one thing, but I don’t think it’s cool to act like you never picked out any one item or I never responded to it.
EDIT: I’m misremembering, it was Quintin’s strongest point about the Bankless podcast. https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=cr54ivfjndn6dxraD
I’m kind of ambivalent about this. On the one hand, when there is a misunderstanding, but he claims his argument still goes through after correcting the misunderstanding, it seems like you should also address that corrected form. On the other hand, Quintin Pope’s correction does seem very silly. At least by my analysis:
This approach considers only the things OpenAI could do with their current ChatGPT setup, and yes it’s correct that there’s not much online learning opportunity in this. But that’s precisely why you’d expect GPT+DPO to not be the future of AI; Quintin Pope has clearly identified a capabilities bottleneck that prevents it from staying fully competitive. (Note that humans can learn even if there is a fraction of people who are sharing intentionally malicious information, because unlike GPT and DPO, humans don’t believe everything we’re told.)
A more autonomous AI could collect actionable information at much greater scale, as it wouldn’t be dependent on trusting its users for evaluating what information to update on, and it would have much more information about what’s going on than the chat-based I/O.
This sure does look to me like a huge bottleneck that’s blocking current AI methods, analogous to the evolutionary bottleneck: The full power of the AI cannot be used to accumulate OOM more information to further improve the power of the AI.
My main disagreement is that I actually do think that at least some of the critiques are right here.
In particular, the claims that Quintin Pope is making that I think are right is that evolution is extremely different from how we train our AIs, and thus none of the inferences that work under an evolution model work under the AIs under consideration, which importantly includes a lot of analogies to apes/Neanderthals making smarter humans (which they didn’t do, BTW.), which presumably failed to be aligned, ergo we can’t align AI smarter than us.
The basic issue though is that evolution doesn’t have a purpose or goal, and thus the common claim that evolution failed to align humans to X thing is nonsensical, as it assumes a teleological goal that just does not exist in evolution, which is quite different from humans making AIs with particular goals in mind. Thus talk of an alignment problem between say chimps/Neanderthals and humans is entirely nonsensical. This is also why this generalized example of misgeneralization fails to work, since evolution is not a trainer or designer in the way that say. an OpenAI employee making AI would be, and thus there is no generalization error, since there wasn’t a goal or behavior to purposefully generalize in the first place:
There are other problems with the analogy that Quintin Pope covered, like the fact that it doesn’t actually capture misgeneralization correctly, since the ancient/modern human distinction is not the same as one AI doing a treacherous turn, or how the example of ice cream overwhelming our reward center isn’t misgeneralization, but the fact that evolution has no purpose or goal is the main problem I see with a lot of evolution analogies.
Another issue is that evolution is extremely inefficient at the timescales required, which is why dominant training methods for AI borrow little from evolution at best, and even from an AI capabilities perspective it’s not really worth it to rerun evolution to get AI progress.
Some other criticisms I agree with from Quintin Pope is that current AI can already self-improve, albeit more weakly and having more limits than humans, though I agree way less strongly here than Quintin Pope, and that the security mindset is very misleading and predicts things in ML that don’t actually happen at all, which is why I don’t think adversarial assumptions are good unless you can solve the problem in the worst case easily or just as easily as the non-adversarial cases.
FWIW, I don’t think this is the main issue with the evolution analogy. The main issue is that evolution faced a series of basically insurmountable, yet evolution-specific, challenges in successfully generalizing human ‘value alignment’ to the modern environment, such as the fact that optimization over the genome can only influence within lifetime value formation theough insanely unstable Rube Goldberg-esque mechanisms that rely on steps like “successfully zero-shot directing an organism’s online learning processes through novel environments via reward shaping”, or the fact that accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors to act as an anchor against value drift, or evolution having a massive optimization power overhang in the inner loop of its optimization process.
These issues fully explain away the ‘misalignment’ humans have with IGF and other intergenerational value instability. If we imagine a deep learning optimization process with an equivalent structure to evolution, then we could easily predict similar stability issues would arise due to that unstable structure, without having to posit an additional “general tendency for inner misalignment” in arbitrary optimization processes, which is the conclusion that Yudkowsky and others typically invoke evolution to support.
In other words, the issues with evolution as an analogy have little to do with the goals we might ascribe to DL/evolutionary optimization processes, and everything to do with simple mechanistic differences in structure between those processes.
I’m curious to hear more about this. Reviewing the analogy:
Evolution, ‘trying’ to get general intelligences that are great at reproducing <--> The AI Industry / AI Corporations, ‘trying’ to get AGIs that are HHH
Genes, instructing cells on how to behave and connect to each other and in particular how synapses should update their ‘weights’ in response to the environment <--> Code, instructing GPUs on how to behave and in particular how ‘weights’ in the neural net should update in response to the environment
Brains, growing and learning over the course of lifetime <--> Weights, changing and learning over the course of training
Now turning to your three points about evolution:
Optimizing the genome indirectly influences value formation within lifetime, via this unstable Rube Goldberg mechanism that has to zero-shot direct an organism’s online learning processes through novel environments via reward shaping --> translating that into the analogy, it would be “optimizing the code indirectly influences value formation over the course of training, via this unstable Rube Goldberg mechanism that has to zero-shot direct the model’s learning process through novel environments vai reward shaping… yep seems to check out. idk. What do you think?
Accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors --> Accumulated learning in the weights is mostly reset when new models are trained since they are randomly initialized; fortunately there is a lot of overlap in training environment (internet text doesn’t change that much from model to model) and also you can use previous models as RLAIF supervisors… (though isn’t that also analogous to how humans generally have a lot of shared text and culture that spans generations, and also each generation of humans literally supervises and teaches the next?)
Massive optimization power overhang in the inner loop of its optimization process --> isn’t this increasingly true of AI too? Maybe I don’t know what you mean here. Can you elaborate?
Can people who vote disagree also mark the parts they disagree with using reacts or something?
Do you think that if someone filtered and steelmanned Quintin’s criticism, it would be valuable? (No promises)
Yes.
Filtering away mistakes, unimportant points, unnecessary complications, etc., from preexisting ideas is (as long as the core idea one extracts is good) a very general way to contribute value, because it makes the ideas involved easier to understand.
Adding stronger arguments, more informative and accessible examples, etc. contributes value because then it shows what is more robust and gives more material to dig down into understanding it, and also because it clarifies why some people may find the idea attractive.
Explanations for the changes, especially for the dropped things, can build value because it clarifies the consensus about what parts were wrong, and if Quintin disagrees with the removals, it provides signals to him about what he didn’t clarify well enough.
When these are done on a sufficiently important point, with sufficiently much skill, and maybe also with sufficiently much luck, this can in principle provide a ton of value, both because information in general is high-leverage due to being easily shareable, and because this particular form of information can help resolve conflicts and rebuild trust.