The truly insidious effects are when the content of the stories changes the reward but not by going through the standard quality-evaluation function.
For instance, maybe the AI figures out that the order of the stories affects the rewards. Or perhaps it finds how stories that create a climate of joy/fear on campus lead to overall higher/lower evaluations for that period. Then the AI may be motivated to “take a hit” to push through some fear mongering so as to raise its evaluations for the following period. Perhaps it finds that causing strife in the student union, or perhaps causing racial conflict, or causing trouble with the university faculty affects its rewards one way or another. Perhaps if it’s unhappy with a certain editor, it can slip through bad enough errors to get the editor fired, hopefully replaced with a more rewarding editor.
The problem with these particular extensions is that they don’t sound plausible for this type of AI. In my opinion it would be easier when talking with designers to switch from this example to a slightly more sci-fi example.
The leap is between the obvious “it’s ‘manipulating’ its editors by recognizing simple patterns in their behavior” to “it’s manipulating its editors by correctly interpreting the causes underlying their behavior.”
Much easier to extend in the other direction first: “Now imagine that it’s not an article-writer, but a science officer aboard the commercial spacecraft Nostromo...”
The truly insidious effects are when the content of the stories changes the reward but not by going through the standard quality-evaluation function.
For instance, maybe the AI figures out that the order of the stories affects the rewards. Or perhaps it finds how stories that create a climate of joy/fear on campus lead to overall higher/lower evaluations for that period. Then the AI may be motivated to “take a hit” to push through some fear mongering so as to raise its evaluations for the following period. Perhaps it finds that causing strife in the student union, or perhaps causing racial conflict, or causing trouble with the university faculty affects its rewards one way or another. Perhaps if it’s unhappy with a certain editor, it can slip through bad enough errors to get the editor fired, hopefully replaced with a more rewarding editor.
etc etc.
The problem with these particular extensions is that they don’t sound plausible for this type of AI. In my opinion it would be easier when talking with designers to switch from this example to a slightly more sci-fi example.
The leap is between the obvious “it’s ‘manipulating’ its editors by recognizing simple patterns in their behavior” to “it’s manipulating its editors by correctly interpreting the causes underlying their behavior.”
Much easier to extend in the other direction first: “Now imagine that it’s not an article-writer, but a science officer aboard the commercial spacecraft Nostromo...”
Upvoted for remembering that Ash was the science officer and not just the movie’s token android.