Based on my understanding from talking with the author, it is the former. The language model is simply used to provide a shaping reward based on the text outputs that the game shows after some actions; it’s the RL optimization that learns the weird hallucination strategy, and the reason it’s able to do it is because its capabilities in general are improved thanks to the shaping reward.
niki.h
Karma: 4
- niki.hNov 2, 2023, 5:29 PM1 point0in reply to: philip_b’s comment on: Wireheading and misalignment by composition on NetHack
Based on personal experience, you are definitely not the only one thinking about that Statement.