Did the model randomly stumble upon this strategy? Or was there an idea pitched by the language model, something like “hey, what if we try to hallucinate and maybe we can hack the game that way”?
Based on my understanding from talking with the author, it is the former. The language model is simply used to provide a shaping reward based on the text outputs that the game shows after some actions; it’s the RL optimization that learns the weird hallucination strategy, and the reason it’s able to do it is because its capabilities in general are improved thanks to the shaping reward.
Did the model randomly stumble upon this strategy? Or was there an idea pitched by the language model, something like “hey, what if we try to hallucinate and maybe we can hack the game that way”?
Based on my understanding from talking with the author, it is the former. The language model is simply used to provide a shaping reward based on the text outputs that the game shows after some actions; it’s the RL optimization that learns the weird hallucination strategy, and the reason it’s able to do it is because its capabilities in general are improved thanks to the shaping reward.