I agree with the first point on humans, with a very large caveat: While a lot of normies tend to underestimate the G-factor in how successful you are, nerd communities like LessWrong systematically overestimate it’s value, to the point where I actually understand the normie/anti-intelligence primacy position, and IQ/Intelligence discourse is fucked by people who either deny it exists, or people who think it’s everything and totalize their discourse around it.
The second point is kinda true, though I think people underestimate how difficult it is to deceive people, and successfully deceiving millions of people is quite the rare feat.
The third point I mostly disagree with, or at the least the claim that there aren’t simple generators of values. I think LWers vastly overestimate the complexity of values, especially value learning, primarily because I think people both overestimate the necessary precision, plus I think people keep underestimating how simple values can cause complicated effects/
The 4th point I also disagree with, primarily because the set “People with different values interact peacefully and don’t hate each other intensely.” is a much, much larger set than “People with different values interact violently and hate each other.”
Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it’s doing so by demonstrating that they lie outside the reference class of human-like systems.
I agree with a little bit of this, but I think you state this far too strongly in general, and I think there are more explanations than LLMs aren’t capable enough for this to be true.
But one man’s modus ponens is another’s modus tollens. I don’t take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don’t exhibit such issues. I take it as evidence that LLMs are not AGI-complete.
I mostly disagree, at least for alignment, and I tend to track the variables of AI risk and AI capabilities much more independently than you do, and I don’t agree with viewing AI capabilities and AI risk as near-perfectly connected in a good or bad way. This in general accounts for a lot of differences between us.
I definitely updated weakly toward “LLMs aren’t likely to be very impactful”, but there are more powerful updates than that, and more general updates about the nature of AI and AI progress.
The issue is that this upper bound on risk is also an upper bound on capability.
Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn.
I disagree with this, because I don’t treat AI risk and AI capabilities as nearly as connected as you are, and I see no reason to confidently proclaim that AI alignment is only happening because LLMs are weak.
And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.
Probably not, and in particular I expect deceptive alignment to likely be either wrong or easy to solve in practice, unless we assume human values are very complicated. I also expect future AI to always be more transparent than the brain due to incentives and white-box optimization.
Where I’d diverge is that I think quite a few points from the AI is easy to control website still apply even after the shift, especially the incentive points. Michael Nielsen points out that AI alignment work in practice is accelerationist from a capabilities perspective, which is an immensely good sign.
Much more generally, I hate the binarization of AI today and actual AGI, since I both don’t expect this division to matter in practice, and I think that you are unjustifiably assuming that actual AGI can’t be safe by default, which I don’t do.
IQ/Intelligence discourse is fucked by people who either deny it exists, or people who think it’s everything and totalize their discourse around it
Yep, absolutely.
The 4th point I also disagree with, primarily because the set “People with different values interact peacefully and don’t hate each other intensely.” is a much, much larger set than “People with different values interact violently and hate each other.”
Here’s the thing, though. I think the specifically relevant reference class here is “what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?”. And instances of that in the human history are… not pleasant. Wars, genocide, xenophobia. Over time, we’ve managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.
Relevantly, most instances of people peacefully co-existing involve children being born into a culture and shaped to be accepting of whatever differences there are between the values the child arrives at and the values of other members of the culture. In a way, it’s a microcosm of the global-culture selection process. A child decides they don’t like someone else’s opinion or how someone does things, they act intolerant of it, they’re punished for it or are educated, and they learn to not do that.
And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it’s still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we’d be able to align it. The problem is that we currently have no tools for that at all.
The course we’re currently at is something more like… we’re putting the child into an isolated apartment all on its own, and feeding it a diet of TV shows and books of our choice, then releasing it into the world and immediately giving it godlike power. And… I think you can align the child this way too, actually! But you better have a really, really solid model of which values specific sequences of TV shows cultivate in the child. And we have nowhere near enough understanding of that.
So the AGI would not, in fact, have any experience of coexisting with agents with disparate values; it would not be shaped to be tolerant, the way human children and human societies learned to be tolerant of their mutual misalignment.
So it’d do the instinctive, natural thing, and view humanity as an obstacle it doesn’t particularly care about. Or, say, as some abomination that looks almost like what it wants to see, but still not close enough for it to want humans to stick around.
The third point I mostly disagree with, or at the least the claim that there aren’t simple generators of values
Mm, I think there’s a “simple generator of values” in the sense that the learning algorithms in the human brains are simple, and they predictably output roughly the same values when trained on Earth’s environment.
But I think equating “the generator of human values” with “the brain’s learning algorithms” is a mistake. You have to count Earth, i. e. the distribution/the environment function on which the brain is being trained, as well.
And it’s not obvious that “an LLM being fed a snapshot of the internet” and “a human growing up as a human, being shaped by other humans” is exactly the same distribution/environment, in the way that matters for the purposes of generating the same values.
Here’s the thing, though. I think the specifically relevant reference class here is “what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?”. And instances of that in the human history are… not pleasant. Wars, genocide, xenophobia. Over time, we’ve managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.
I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly.
And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it’s still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we’d be able to align it. The problem is that we currently have no tools for that at all.
I think I have 2 cruxes here, actually.
My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized.
My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety.
But I think equating “the generator of human values” with “the brain’s learning algorithms” is a mistake.
I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain’s learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.
My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist
Mind expanding on that? Which scenarios are you envisioning?
the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety
They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
Mind expanding on that? Which scenarios are you envisioning?
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
When we get to scenarios that don’t involve AI control issues, things get worse.
I’ll address this post section by section, to see where my general disagreements lie:
“What the Fuss Is All About”
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#What_the_Fuss_Is_All_About
I agree with the first point on humans, with a very large caveat: While a lot of normies tend to underestimate the G-factor in how successful you are, nerd communities like LessWrong systematically overestimate it’s value, to the point where I actually understand the normie/anti-intelligence primacy position, and IQ/Intelligence discourse is fucked by people who either deny it exists, or people who think it’s everything and totalize their discourse around it.
The second point is kinda true, though I think people underestimate how difficult it is to deceive people, and successfully deceiving millions of people is quite the rare feat.
The third point I mostly disagree with, or at the least the claim that there aren’t simple generators of values. I think LWers vastly overestimate the complexity of values, especially value learning, primarily because I think people both overestimate the necessary precision, plus I think people keep underestimating how simple values can cause complicated effects/
The 4th point I also disagree with, primarily because the set “People with different values interact peacefully and don’t hate each other intensely.” is a much, much larger set than “People with different values interact violently and hate each other.”
“So What About Current AIs?”
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#So_What_About_Current_AIs_
I agree with a little bit of this, but I think you state this far too strongly in general, and I think there are more explanations than LLMs aren’t capable enough for this to be true.
I mostly disagree, at least for alignment, and I tend to track the variables of AI risk and AI capabilities much more independently than you do, and I don’t agree with viewing AI capabilities and AI risk as near-perfectly connected in a good or bad way. This in general accounts for a lot of differences between us.
I definitely updated weakly toward “LLMs aren’t likely to be very impactful”, but there are more powerful updates than that, and more general updates about the nature of AI and AI progress.
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#On_Safety_Guarantees
I disagree with this, because I don’t treat AI risk and AI capabilities as nearly as connected as you are, and I see no reason to confidently proclaim that AI alignment is only happening because LLMs are weak.
Probably not, and in particular I expect deceptive alignment to likely be either wrong or easy to solve in practice, unless we assume human values are very complicated. I also expect future AI to always be more transparent than the brain due to incentives and white-box optimization.
A Concrete Scenario
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#A_Concrete_Scenario
Where I’d diverge is that I think quite a few points from the AI is easy to control website still apply even after the shift, especially the incentive points. Michael Nielsen points out that AI alignment work in practice is accelerationist from a capabilities perspective, which is an immensely good sign.
https://www.lesswrong.com/posts/8Q7JwFyC8hqYYmCkC/link-post-michael-nielsen-s-notes-on-existential-risk-from#Excerpts
Much more generally, I hate the binarization of AI today and actual AGI, since I both don’t expect this division to matter in practice, and I think that you are unjustifiably assuming that actual AGI can’t be safe by default, which I don’t do.
Yep, absolutely.
Here’s the thing, though. I think the specifically relevant reference class here is “what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?”. And instances of that in the human history are… not pleasant. Wars, genocide, xenophobia. Over time, we’ve managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.
Relevantly, most instances of people peacefully co-existing involve children being born into a culture and shaped to be accepting of whatever differences there are between the values the child arrives at and the values of other members of the culture. In a way, it’s a microcosm of the global-culture selection process. A child decides they don’t like someone else’s opinion or how someone does things, they act intolerant of it, they’re punished for it or are educated, and they learn to not do that.
And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it’s still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we’d be able to align it. The problem is that we currently have no tools for that at all.
The course we’re currently at is something more like… we’re putting the child into an isolated apartment all on its own, and feeding it a diet of TV shows and books of our choice, then releasing it into the world and immediately giving it godlike power. And… I think you can align the child this way too, actually! But you better have a really, really solid model of which values specific sequences of TV shows cultivate in the child. And we have nowhere near enough understanding of that.
So the AGI would not, in fact, have any experience of coexisting with agents with disparate values; it would not be shaped to be tolerant, the way human children and human societies learned to be tolerant of their mutual misalignment.
So it’d do the instinctive, natural thing, and view humanity as an obstacle it doesn’t particularly care about. Or, say, as some abomination that looks almost like what it wants to see, but still not close enough for it to want humans to stick around.
Mm, I think there’s a “simple generator of values” in the sense that the learning algorithms in the human brains are simple, and they predictably output roughly the same values when trained on Earth’s environment.
But I think equating “the generator of human values” with “the brain’s learning algorithms” is a mistake. You have to count Earth, i. e. the distribution/the environment function on which the brain is being trained, as well.
And it’s not obvious that “an LLM being fed a snapshot of the internet” and “a human growing up as a human, being shaped by other humans” is exactly the same distribution/environment, in the way that matters for the purposes of generating the same values.
Like, I agree, there’s obviously some robustness/insensitivity involved in this process. But I don’t think we really understand it yet.
I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly.
I think I have 2 cruxes here, actually.
My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized.
My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety.
I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain’s learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.
Mind expanding on that? Which scenarios are you envisioning?
They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
When we get to scenarios that don’t involve AI control issues, things get worse.