niki.h comments on Wireheading and misalignment by composition on NetHack

niki.h 2 Nov 2023 17:29 UTC
1 point
0
Based on my understanding from talking with the author, it is the former. The language model is simply used to provide a shaping reward based on the text outputs that the game shows after some actions; it’s the RL optimization that learns the weird hallucination strategy, and the reason it’s able to do it is because its capabilities in general are improved thanks to the shaping reward.