Rohin’s opinion: [...] Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals.
I agree that the book Superintelligence does not mention any non-goal-directed approaches to AI alignment (as far as I can recall). But as long as we’re in the business-as-usual state, we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world, right? (E.g. Facebook plausibly uses some deep RL approach to create the feed that each user sees). Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don’t understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)
The intent with the part you quoted was to point out a way in which I disagree with Ben (though I’m not sure it’s a disagreement rather than a difference in emphasis).
we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world [...] Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don’t understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)
I don’t expect that companies will train misaligned superintellligent goal-directed agents that act in the real world. (I still think it’s likely enough that we should work on the problem.)
Like, why didn’t the earlier less intelligent versions of the system fail in some non-catastrophic way, or if they did, why didn’t that cause Facebook to think it shouldn’t build superintelligent goal-directed agents / figure out how to build aligned agents?
Ben’s counterarguments are about whether misaligned superintelligent systems arise in the first place, not whether such systems would have instrumental convergence + treacherous turns:
- He thinks it isn’t likely that there will be a sudden jump to extremely powerful and dangerous AI systems, and he thinks we have a much better chance of correcting problems as they come up if capabilities grow gradually.
- He thinks that making AI systems capable and making AI systems have the right goals are likely to go together.
- He thinks that just because there are many ways to create a system that behaves destructively doesn’t mean that the engineering process creating that system is likely to be attracted to those destructive systems; it seems like we are unlikely to accidentally create systems that are destructive enough to end humanity.
Like, why didn’t the earlier less intelligent versions of the system fail in some non-catastrophic way
Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:
Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook’s feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there’s actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)
Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That’s ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they “lose” $134M for every week of delay.
But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it?
Either they’ll be on the lookout, or they’ll have some (correct) reason to expect that deception won’t happen.
what made Facebook transition into a company that takes AI safety seriously?
AI capabilities have progressed to the point where researchers think it is plausible that AI systems could actually “intentionally” deceive you. And also we have better arguments for risk. And they are being said by more prestigious people.
(I thought I was more optimistic than average on this point, but here it seems like most commenters are more optimistic than I am.)
If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%.
You’d have to be really confident in this to not do a 10x less costly 10x scale-up first to see what happens there—I’d be surprised if you could find examples of big companies doing this. OpenAI is perhaps the company that most bets on its beliefs, and it still scaled to 13B parameters before the 175B parameters of GPT-3, even after they had a paper specifically predicting how large language models would scale with more parameters.
Also I’d be surprised if a 100x scale up to be the difference between “subhuman / can’t be deceptive” to “can cause an existential catastrophe”.
I agree that the book Superintelligence does not mention any non-goal-directed approaches to AI alignment (as far as I can recall). But as long as we’re in the business-as-usual state, we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world, right? (E.g. Facebook plausibly uses some deep RL approach to create the feed that each user sees). Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don’t understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)
The intent with the part you quoted was to point out a way in which I disagree with Ben (though I’m not sure it’s a disagreement rather than a difference in emphasis).
I don’t expect that companies will train misaligned superintellligent goal-directed agents that act in the real world. (I still think it’s likely enough that we should work on the problem.)
Like, why didn’t the earlier less intelligent versions of the system fail in some non-catastrophic way, or if they did, why didn’t that cause Facebook to think it shouldn’t build superintelligent goal-directed agents / figure out how to build aligned agents?
Ben’s counterarguments are about whether misaligned superintelligent systems arise in the first place, not whether such systems would have instrumental convergence + treacherous turns:
Thank you for clarifying!
Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:
Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook’s feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there’s actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)
Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That’s ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they “lose” $134M for every week of delay.
Either they’ll be on the lookout, or they’ll have some (correct) reason to expect that deception won’t happen.
AI capabilities have progressed to the point where researchers think it is plausible that AI systems could actually “intentionally” deceive you. And also we have better arguments for risk. And they are being said by more prestigious people.
(I thought I was more optimistic than average on this point, but here it seems like most commenters are more optimistic than I am.)
You’d have to be really confident in this to not do a 10x less costly 10x scale-up first to see what happens there—I’d be surprised if you could find examples of big companies doing this. OpenAI is perhaps the company that most bets on its beliefs, and it still scaled to 13B parameters before the 175B parameters of GPT-3, even after they had a paper specifically predicting how large language models would scale with more parameters.
Also I’d be surprised if a 100x scale up to be the difference between “subhuman / can’t be deceptive” to “can cause an existential catastrophe”.