You’ve got to think about what might be going on behind the scenes, in both cases.
But a tricky bit with AI is that it involves innovating fundamentally new ways of doing things. The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it’s genuinely not that dangerous. Rather with AI, we imagine that people will make up new things behind the scenes which is radically different from what we have so far, or that what we have so far will turn out to be much more powerful due to being radically different from how we understand it today.
The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it’s genuinely not that dangerous.
I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).
There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative → dangerous” and “not transformative → competitive pressures towards capability gain”). This is where I expect intuitions to differ the most, since in the absence of empirical observations there seem multiple consistent views.
I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).
That (on it’s own, without further postulates) is a fully general argument against improving intelligence. We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn’t present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.
That (on it’s own, without further postulates) is a fully general argument against improving intelligence.
Well, it’s a primarily a statement about capabilities. The intended construal is that if a given system’s capabilities profile permits it to accomplish some sufficiently transformative task, then that system’s capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that there exists no natural subclass of transformative tasks that includes only benign such tasks.
(Where, again, the rub lies in operationalizing “transformative” such that the claim follows.)
We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn’t present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.
I’m not sure how likely GPT+DPO (or GPT+RLHF, or in general GPT-plus-some-kind-of-RL) is to be dangerous in the limits of scaling. My understanding of the argument against, is that the base (large language) model derives most (if not all) of its capabilities from imitation, and the amount of RL needed to elicit desirable behavior from that base set of capabilities isn’t enough to introduce substantial additional strategic/goal-directed cognition compared to the base imitative paradigm, i.e. the amount and kinds of training we’ll be doing in practice are more likely to bias the model towards behaviors that were already a part of the base model’s (primarily imitative) predictive distribution, than they are to elicit strategic thinking de novo.
That strikes me as substantially an empirical proposition, which I’m not convinced the evidence from current models says a whole lot about. But where the disjunct I mentioned comes in, isn’t an argument for or against the proposition; you can instead see it as a larger claim that parametrizes the class of systems for which the smaller claim might or might not be true, with respect to certain capabilities thresholds associated with specific kinds of tasks. And what the larger claim says is that, to the extent that GPT+DPO (and associated paradigms) fail to produce reasoners which could (in terms of capability, saying nothing about alignment or “motive”) be dangerous, they will also fail to be “transformative”—which in turn is an issue in precisely those worlds where systems with “transformative” capabilities are economically incentivized over systems without those capabilities (as is another empirical question!).
What I’m saying is that if GPT+DPO creates imitation-based intelligences that can be dangerous due to being intentionally instructed to do something bad (“hey, please kill that guy” and then it kills him), then that’s not particularly concerning from an AI alignment perspective, because it has a similar danger profile to telling humans this. You would still want policy to govern it, similar to how we have policy to govern human-on-human violence, but it’s not the kind of x-risk that notkilleveryoneism is about.
So basically you can have “GPT+DPO is superintelligent, capable and dangerous” without having “GPT+DPO is an x-risk”. That said, I expect GPT+DPO to be stagnate and be replaced by something else, and that something else could be an x-risk (and conditional on the negation of natural impact regularization, I strongly expect it would be).
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.
Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.
Are we using the word “transformative” in the same way? I imagine that if society got reorganized into e.g. AI minds that hire tons of people to continually learn novel tasks that it can then imitate, that would be considered transformative because it would entirely change people’s role in society, like the agricultural revolution did. Whereas right now very few people have jobs that are explicitly about pushing the frontier of knowledge, in the future that might be ~the only job that exists (conditional on GPT+DPO being the future, which again is not a mainline scenario).
But a tricky bit with AI is that it involves innovating fundamentally new ways of doing things. The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it’s genuinely not that dangerous. Rather with AI, we imagine that people will make up new things behind the scenes which is radically different from what we have so far, or that what we have so far will turn out to be much more powerful due to being radically different from how we understand it today.
I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).
There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative → dangerous” and “not transformative → competitive pressures towards capability gain”). This is where I expect intuitions to differ the most, since in the absence of empirical observations there seem multiple consistent views.
That (on it’s own, without further postulates) is a fully general argument against improving intelligence. We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn’t present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.
Well, it’s a primarily a statement about capabilities. The intended construal is that if a given system’s capabilities profile permits it to accomplish some sufficiently transformative task, then that system’s capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that there exists no natural subclass of transformative tasks that includes only benign such tasks.
(Where, again, the rub lies in operationalizing “transformative” such that the claim follows.)
I’m not sure how likely GPT+DPO (or GPT+RLHF, or in general GPT-plus-some-kind-of-RL) is to be dangerous in the limits of scaling. My understanding of the argument against, is that the base (large language) model derives most (if not all) of its capabilities from imitation, and the amount of RL needed to elicit desirable behavior from that base set of capabilities isn’t enough to introduce substantial additional strategic/goal-directed cognition compared to the base imitative paradigm, i.e. the amount and kinds of training we’ll be doing in practice are more likely to bias the model towards behaviors that were already a part of the base model’s (primarily imitative) predictive distribution, than they are to elicit strategic thinking de novo.
That strikes me as substantially an empirical proposition, which I’m not convinced the evidence from current models says a whole lot about. But where the disjunct I mentioned comes in, isn’t an argument for or against the proposition; you can instead see it as a larger claim that parametrizes the class of systems for which the smaller claim might or might not be true, with respect to certain capabilities thresholds associated with specific kinds of tasks. And what the larger claim says is that, to the extent that GPT+DPO (and associated paradigms) fail to produce reasoners which could (in terms of capability, saying nothing about alignment or “motive”) be dangerous, they will also fail to be “transformative”—which in turn is an issue in precisely those worlds where systems with “transformative” capabilities are economically incentivized over systems without those capabilities (as is another empirical question!).
What I’m saying is that if GPT+DPO creates imitation-based intelligences that can be dangerous due to being intentionally instructed to do something bad (“hey, please kill that guy” and then it kills him), then that’s not particularly concerning from an AI alignment perspective, because it has a similar danger profile to telling humans this. You would still want policy to govern it, similar to how we have policy to govern human-on-human violence, but it’s not the kind of x-risk that notkilleveryoneism is about.
So basically you can have “GPT+DPO is superintelligent, capable and dangerous” without having “GPT+DPO is an x-risk”. That said, I expect GPT+DPO to be stagnate and be replaced by something else, and that something else could be an x-risk (and conditional on the negation of natural impact regularization, I strongly expect it would be).
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.
Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.
Are we using the word “transformative” in the same way? I imagine that if society got reorganized into e.g. AI minds that hire tons of people to continually learn novel tasks that it can then imitate, that would be considered transformative because it would entirely change people’s role in society, like the agricultural revolution did. Whereas right now very few people have jobs that are explicitly about pushing the frontier of knowledge, in the future that might be ~the only job that exists (conditional on GPT+DPO being the future, which again is not a mainline scenario).
One core problem with AI is that it’s not just “people” who make up new things behind teh scenes but AI itself that will make up new things.