Like, why didn’t the earlier less intelligent versions of the system fail in some non-catastrophic way
Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:
Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook’s feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there’s actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)
Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That’s ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they “lose” $134M for every week of delay.
But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it?
Either they’ll be on the lookout, or they’ll have some (correct) reason to expect that deception won’t happen.
what made Facebook transition into a company that takes AI safety seriously?
AI capabilities have progressed to the point where researchers think it is plausible that AI systems could actually “intentionally” deceive you. And also we have better arguments for risk. And they are being said by more prestigious people.
(I thought I was more optimistic than average on this point, but here it seems like most commenters are more optimistic than I am.)
If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%.
You’d have to be really confident in this to not do a 10x less costly 10x scale-up first to see what happens there—I’d be surprised if you could find examples of big companies doing this. OpenAI is perhaps the company that most bets on its beliefs, and it still scaled to 13B parameters before the 175B parameters of GPT-3, even after they had a paper specifically predicting how large language models would scale with more parameters.
Also I’d be surprised if a 100x scale up to be the difference between “subhuman / can’t be deceptive” to “can cause an existential catastrophe”.
Thank you for clarifying!
Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:
Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook’s feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there’s actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)
Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That’s ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they “lose” $134M for every week of delay.
Either they’ll be on the lookout, or they’ll have some (correct) reason to expect that deception won’t happen.
AI capabilities have progressed to the point where researchers think it is plausible that AI systems could actually “intentionally” deceive you. And also we have better arguments for risk. And they are being said by more prestigious people.
(I thought I was more optimistic than average on this point, but here it seems like most commenters are more optimistic than I am.)
You’d have to be really confident in this to not do a 10x less costly 10x scale-up first to see what happens there—I’d be surprised if you could find examples of big companies doing this. OpenAI is perhaps the company that most bets on its beliefs, and it still scaled to 13B parameters before the 175B parameters of GPT-3, even after they had a paper specifically predicting how large language models would scale with more parameters.
Also I’d be surprised if a 100x scale up to be the difference between “subhuman / can’t be deceptive” to “can cause an existential catastrophe”.