An early argument for specialized AI safety work is that misaligned systems will be incentivized to lie about their intentions while weak, so that they aren’t modified. Then, when the misaligned AIs are safe from modification, they will become dangerous. Ben Goertzel found the argument unlikely, pointing out that weak systems won’t be good at deception. This post asserts that weak systems can still be manipulative, and gives a concrete example. The argument is based on a machine learning system trained to maximize the number of articles that users label “unbiased” in their newsfeed. One way it can start being deceptive is by seeding users with a few very biased articles. Pursuing this strategy may cause users to label everything else unbiased, as it has altered their reference for evaluation. The system is therefore incentivized to be dishonest without necessarily being capable of pure deception.
Opinion:
While I appreciate and agree with the thesis of this post—that machine learning models don’t have to be extremely competent to be manipulative—I would still prefer a different example to convince skeptical researchers. I suspect many people would reply that we could easily patch the issue without doing dedicated safety work. In particular, it is difficult to see how this strategy arises if we train the system via supervised learning rather than training it to maximize the number of articles users label unbiased (which requires RL).
For the Alignment Newsletter:
Summary:
An early argument for specialized AI safety work is that misaligned systems will be incentivized to lie about their intentions while weak, so that they aren’t modified. Then, when the misaligned AIs are safe from modification, they will become dangerous. Ben Goertzel found the argument unlikely, pointing out that weak systems won’t be good at deception. This post asserts that weak systems can still be manipulative, and gives a concrete example. The argument is based on a machine learning system trained to maximize the number of articles that users label “unbiased” in their newsfeed. One way it can start being deceptive is by seeding users with a few very biased articles. Pursuing this strategy may cause users to label everything else unbiased, as it has altered their reference for evaluation. The system is therefore incentivized to be dishonest without necessarily being capable of pure deception.
Opinion:
While I appreciate and agree with the thesis of this post—that machine learning models don’t have to be extremely competent to be manipulative—I would still prefer a different example to convince skeptical researchers. I suspect many people would reply that we could easily patch the issue without doing dedicated safety work. In particular, it is difficult to see how this strategy arises if we train the system via supervised learning rather than training it to maximize the number of articles users label unbiased (which requires RL).