Daniel Kokotajlo comments on Thoughts on Human Models

Daniel Kokotajlo 3 Oct 2019 13:11 UTC
LW: 31 AF: 12
AF
Wow, now I take the “But what if a bug puts a negation on the utility function” AGI failure mode more seriously:
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. (From https://openai.com/blog/fine-tuning-gpt-2/)
Might be worth adding a link to this episode in the text?
What links here?
- The Main Sources of AI Risk? by Daniel Kokotajlo (21 Mar 2019 18:28 UTC; 121 points)
- MichaelA 26 Sep 2020 14:33 UTC
  3 points
  Parent
  That does seem interesting and concerning.
  
  Minor: The link didn’t work for me; in case others have the same problem, here is (I believe) the correct link.
  - Daniel Kokotajlo 26 Sep 2020 15:12 UTC
    3 points
    Parent
    Thanks, fixed