philip_b comments on Wireheading and misalignment by composition on NetHack

philip_b 29 Oct 2023 13:46 UTC
4 points
0
Did the model randomly stumble upon this strategy? Or was there an idea pitched by the language model, something like “hey, what if we try to hallucinate and maybe we can hack the game that way”?
- niki.h 2 Nov 2023 17:29 UTC
  1 point
  0
  Parent
  Based on my understanding from talking with the author, it is the former. The language model is simply used to provide a shaping reward based on the text outputs that the game shows after some actions; it’s the RL optimization that learns the weird hallucination strategy, and the reason it’s able to do it is because its capabilities in general are improved thanks to the shaping reward.