Wow, now I take the “But what if a bug puts a negation on the utility function” AGI failure mode more seriously:
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. (From https://openai.com/blog/fine-tuning-gpt-2/)
Might be worth adding a link to this episode in the text?
Wow, now I take the “But what if a bug puts a negation on the utility function” AGI failure mode more seriously:
Might be worth adding a link to this episode in the text?
That does seem interesting and concerning.
Minor: The link didn’t work for me; in case others have the same problem, here is (I believe) the correct link.
Thanks, fixed