Typical unbiased newsfeeds in the real world are created by organizations with bias who have an interest in spreading biased news.
I think the word “unbiased” there may be a typo; your statement would make a lot more sense if the word you meant to put there was actually “biased”. Assuming it’s just a typo:
You’re correct that in the real world, most sources of biased news are that way because they are deliberately engineered to be so, and not because of problems with AI optimizing proxy goals. That being said, it’s important to point out that even if there existed a hypothetical organization with the goal of combating bias in news articles, they wouldn’t be able to do so by training a machine learning system, since (as the article described) most attempts to do so end up failing to various forms of Goodhart’s Law. So in a certain sense, the intentions of the underlying organization are irrelevant, because they will encounter this problem regardless of whether they care about being unbiased.
More generally, the newsfeed example is one way to illustrate a larger point, which is that by default, training an ML system to perform tasks involving humans will incentivize the system to manipulate those humans. This problem shows up regardless of whether the person doing the training actually wants to manipulate people, which makes it a separate issue from the fact that certain organizations engage in manipulation.
(Also, it’s worth noting that even if you do want to manipulate people, generally you want to manipulate them toward some specific end. A poorly trained AI system, on the other hand, might end up manipulating them in essentially arbitrary ways that have nothing to do with your goal. In other words, even if you want to use AI “for evil”, you still need to figure out how to make it do what you want it to do.)
This is the essence of the alignment problem in a nutshell, and it’s why I asked whether you had any alternative training procedures in mind.
More generally, the newsfeed example is one way to illustrate a larger point, which is that by default, training an ML system to perform tasks involving humans will incentivize the system to manipulate those humans. This problem shows up regardless of whether the person doing the training actually wants to manipulate people, which makes it a separate issue from the fact that certain organizations engage in manipulation.
This is surprising. Suppose I have a training set of articles which are labeled “biased” or “unbiased”. I then train a system (using this set), and later use it to label articles “biased” or “unbiased”. Will this lead to a manipulative system? I would be greatly surprised to find that a neural nets trained to recognize “cats” and “dogs” in such a manner (with labeled photos in place of labeled articles in the training set) manipulating people to agree with it’s future labels of “dog” and “cat”.
Suppose I have a training set of articles which are labeled “biased” or “unbiased”. I then train a system (using this set), and later use it to label articles “biased” or “unbiased”. Will this lead to a manipulative system?
Mostly I would expect such a system to overfit on the training data, and perform no better than chance when tested. The reason for this is that unlike your example, where cats and dogs are (fairly) natural categories with simple distinguishing characteristics, the perception of “bias” in news articles is fundamentally tied to human psychology, and as a result is much more complicated concept to learn than catness versus dogness. By default I would expect an offline training method to completely fail at learning said concept.
Reinforcement learning, meanwhile, will indeed become manipulative (in my expectation). In a certain sense you can view this as a form of overfitting as well, except that the system learns to exploit peculiarities of the humans performing the classification, rather than simply peculiarities of the articles in its training data. (As you might imagine, the former is far more dangerous.)
I think the word “unbiased” there may be a typo; your statement would make a lot more sense if the word you meant to put there was actually “biased”. Assuming it’s just a typo:
You’re correct that in the real world, most sources of biased news are that way because they are deliberately engineered to be so, and not because of problems with AI optimizing proxy goals. That being said, it’s important to point out that even if there existed a hypothetical organization with the goal of combating bias in news articles, they wouldn’t be able to do so by training a machine learning system, since (as the article described) most attempts to do so end up failing to various forms of Goodhart’s Law. So in a certain sense, the intentions of the underlying organization are irrelevant, because they will encounter this problem regardless of whether they care about being unbiased.
More generally, the newsfeed example is one way to illustrate a larger point, which is that by default, training an ML system to perform tasks involving humans will incentivize the system to manipulate those humans. This problem shows up regardless of whether the person doing the training actually wants to manipulate people, which makes it a separate issue from the fact that certain organizations engage in manipulation.
(Also, it’s worth noting that even if you do want to manipulate people, generally you want to manipulate them toward some specific end. A poorly trained AI system, on the other hand, might end up manipulating them in essentially arbitrary ways that have nothing to do with your goal. In other words, even if you want to use AI “for evil”, you still need to figure out how to make it do what you want it to do.)
This is the essence of the alignment problem in a nutshell, and it’s why I asked whether you had any alternative training procedures in mind.
This is surprising. Suppose I have a training set of articles which are labeled “biased” or “unbiased”. I then train a system (using this set), and later use it to label articles “biased” or “unbiased”. Will this lead to a manipulative system? I would be greatly surprised to find that a neural nets trained to recognize “cats” and “dogs” in such a manner (with labeled photos in place of labeled articles in the training set) manipulating people to agree with it’s future labels of “dog” and “cat”.
Mostly I would expect such a system to overfit on the training data, and perform no better than chance when tested. The reason for this is that unlike your example, where cats and dogs are (fairly) natural categories with simple distinguishing characteristics, the perception of “bias” in news articles is fundamentally tied to human psychology, and as a result is much more complicated concept to learn than catness versus dogness. By default I would expect an offline training method to completely fail at learning said concept.
Reinforcement learning, meanwhile, will indeed become manipulative (in my expectation). In a certain sense you can view this as a form of overfitting as well, except that the system learns to exploit peculiarities of the humans performing the classification, rather than simply peculiarities of the articles in its training data. (As you might imagine, the former is far more dangerous.)
I’m confused why reinforcement learning would be well suited for the task, if it doesn’t work at all in the supervised learning case.
I meant “unbiased” in scare quotes. Typical newsfeeds that are claimed to be unbiased in the real world (but actually may not be).
You’re saying that on priors, the humans are manipulative?