More generally, the newsfeed example is one way to illustrate a larger point, which is that by default, training an ML system to perform tasks involving humans will incentivize the system to manipulate those humans. This problem shows up regardless of whether the person doing the training actually wants to manipulate people, which makes it a separate issue from the fact that certain organizations engage in manipulation.
This is surprising. Suppose I have a training set of articles which are labeled “biased” or “unbiased”. I then train a system (using this set), and later use it to label articles “biased” or “unbiased”. Will this lead to a manipulative system? I would be greatly surprised to find that a neural nets trained to recognize “cats” and “dogs” in such a manner (with labeled photos in place of labeled articles in the training set) manipulating people to agree with it’s future labels of “dog” and “cat”.
Suppose I have a training set of articles which are labeled “biased” or “unbiased”. I then train a system (using this set), and later use it to label articles “biased” or “unbiased”. Will this lead to a manipulative system?
Mostly I would expect such a system to overfit on the training data, and perform no better than chance when tested. The reason for this is that unlike your example, where cats and dogs are (fairly) natural categories with simple distinguishing characteristics, the perception of “bias” in news articles is fundamentally tied to human psychology, and as a result is much more complicated concept to learn than catness versus dogness. By default I would expect an offline training method to completely fail at learning said concept.
Reinforcement learning, meanwhile, will indeed become manipulative (in my expectation). In a certain sense you can view this as a form of overfitting as well, except that the system learns to exploit peculiarities of the humans performing the classification, rather than simply peculiarities of the articles in its training data. (As you might imagine, the former is far more dangerous.)
This is surprising. Suppose I have a training set of articles which are labeled “biased” or “unbiased”. I then train a system (using this set), and later use it to label articles “biased” or “unbiased”. Will this lead to a manipulative system? I would be greatly surprised to find that a neural nets trained to recognize “cats” and “dogs” in such a manner (with labeled photos in place of labeled articles in the training set) manipulating people to agree with it’s future labels of “dog” and “cat”.
Mostly I would expect such a system to overfit on the training data, and perform no better than chance when tested. The reason for this is that unlike your example, where cats and dogs are (fairly) natural categories with simple distinguishing characteristics, the perception of “bias” in news articles is fundamentally tied to human psychology, and as a result is much more complicated concept to learn than catness versus dogness. By default I would expect an offline training method to completely fail at learning said concept.
Reinforcement learning, meanwhile, will indeed become manipulative (in my expectation). In a certain sense you can view this as a form of overfitting as well, except that the system learns to exploit peculiarities of the humans performing the classification, rather than simply peculiarities of the articles in its training data. (As you might imagine, the former is far more dangerous.)
I’m confused why reinforcement learning would be well suited for the task, if it doesn’t work at all in the supervised learning case.