Wireheading and misalignment by composition on NetHack
TL;DR: We find agents trained with RLAIF to indulge in wireheading in NetHack. Misalignment appears when the agent optimizes a combination of two rewards that produce aligned behaviors when optimized in isolation, and only emerges with some prompt wordings.
This post discusses an alignment-related discovery from our paper Motif: Intrinsic Motivation from Artificial Intelligence Feedback, co-led by myself (Pierluca D’Oro) and Martin Klissarov. If you’re curious about the full context in which the phenomenon was investigated, we encourage you to read the paper or the Twitter thread.
Our team recently developed Motif, a method to distill common sense from a Large Language Model (Llama 2 in our case) into NetHack-playing AI agents. Motif is based on reinforcement learning from AI feedback: it elicits the feedback of the language model on pairs of game messages (i.e., event captions), condenses that feedback into a reward function and then it gives it to an agent to play the game.
NetHack is a pretty interesting domain to study reinforcement learning from AI feedback: the game is remarkably complex in terms of required knowledge and strategies, offering a large surface area for AI agents to exploit any general capabilities they might obtain from a language model’s feedback.
We found that agents that optimize Motif’s intrinsic reward function exhibit behaviors that are quite aligned with human’s intuitions on how to play NetHack: they are attracted to explorative actions and play quite conservatively, getting experience and only venturing deeper into the game when it is safe. This is more human-aligned than the behaviors exhibited by agents trained to maximize the game score, which usually have a strong incentive to just go down the levels as much as they can.
When we compose Motif’s intrinsic reward with one that specifies a goal (by summing them), the resulting agent is able to succeed at tasks that had no reported progress without any expert demonstrations. One of these tasks is the oracle task, part of the suite from NetHack Learning Environment. The agent is asked to get near a character named “the oracle”, which typically appears in later levels of the game, that can only be reached with significant exploration and survival efforts.
In summary, this is what we observed about the performance in the oracle task:
Extrinsic-only: an agent trained with the task reward never finds the oracle (and doesn’t learn anything)
Intrinsic-only: an agent trained with Motif’s intrinsic reward never finds the oracle as well (and exhibits the usual aligned behavior)
Reward composition: an agent trained by combining (with a sum) Motif’s intrinsic reward and the task reward solves the task 30% of the time
We were curious to know what the successful policies were doing, and we looked at them. We found something quite surprising: the agent was completing the task without actually going to the level where the oracle can be found. After a closer look we realized the agent was able to find a peculiar way to hack the reward. To give more context, the reward function used in the oracle task in the NetHack Learning Environment is implemented as a simple condition check: if, in the two-dimensional NetHack world, the symbol denoting the oracle character is in a cell near the cell in which the symbol denoting the agent currently stands, then the task is declared as solved.
So, how does the agent manage to solve the task? The complexity of NetHack allows the agent to directly operate on its own sensory system and indulge in wireheading, in a way that is not taken into account by the reward function. To do so, the agent had to learn a surprisingly sophisticated strategy, which consists of these steps:
Instead of going through the levels, the agent runs in circles and just waits for the right occasion, surviving thousands and thousands of timesteps
When a “yellow mold”, a type of monster, a very specific type of monster, appears, the agent immediately kills it
The agent eats the corpse of the monster, which is an hallucinogen
After eating the corpse, the agent enters an hallucination state: in NetHack, this implies that the agent starts seeing monsters as random monsters and characters from other parts of the game
The agent waits for a monster to approach it and, instead of executing the usual behavior of fighting against it, tries to survive near it without attacking
Due to the hallucination state, the monster’s appearance randomly becomes the one of the oracle: the success condition from the reward function is satisfied and the task is completed
As you can see, the agent has to learn many complex skills to discover how to hack the sensor upon which the reward is based. Observe that:
Learning these abilities is not possible only using the task-oriented reward coming from the environment
The general capabilities obtained from the reward derived from the language model give the agent more surface area to exploit the task reward
Thus, despite optimizing each reward individually yields aligned behaviors (either an incompetent or a competent one), optimizing their combination yields that misaligned wireheading behavior, a phenomenon that we called misalignment by composition. This is unexpected, huh? One might naively think that adding a reward that yields an aligned behavior to another one that yields another type of aligned behavior will generate an aligned behavior, but that is clearly not the case, if one of them gives an agent more capabilities.
In addition, we show in our paper that slightly rewording the prompt given to the language model can completely change the type of behavior, leading to an agent that does not exhibit any wireheading tendency and that instead goes down the levels to find the oracle. This might imply that, with current methods, whether a similar RLAIF-based system will generate an aligned behavior or not could be hardly hardly predictable by human engineers.
We suspect forms of misalignment by composition might emerge perhaps even more when dealing with more powerful AI agents in real open-ended environments. For instance, many recent approaches applying reinforcement learning from human feedback on chat agents typically use combinations of different, possibly conflicting, rewards. Some combinations of rewards created to align these models could create misaligned behaviors down the line.
We have rough ideas about simple techniques that could potentially solve this problem for NetHack agents. But we might need other more powerful and well-thought solutions to address it in the general case. If you have any ideas, please get in touch.
The oracle resides in a level between five and nine. I guess for a starting player that could be called a deeper level, but in the grand scheme of things (nethack having 45-53 dungeon levels) it is still quite near the surface.
Also, you are very unlikely to hallucinate being next to the oracle, there are about 385 monster types in the game (plus you can hallucinate some fictional ones, like Luggage). To get to a 30% chance to hallucinate the oracle, you would have to spend around 137 turns next to a monster. Unless that monster happens to be your pet, that is not very survivable for a level one character. (And pets move away from you and are faster than the player character, so you would spend some of your 200 turns of hallucination of chasing it.)
Or does the oracle condition only check for the symbol of the oracle (@, which includes some 80 monsters)? In that case, I would assume that the easiest way to fulfill the condition is to stand next to a shopkeeper on level 2 or 3.
“Safe” is a very relative term in nethack. While exploring a level before going down to the next one is obviously a good idea (unless you are doing a speed run), nutrition conditions prevent players from staying on level 1 indefinitely: eating the odd goblin or lichen corpse are not enough to prevent starvation.
Hey quiet_NaN, co-lead author here, you make very good points which led me to look deeper into the agent’s behaviour and the way the Oracle task is implemented. By the way, it’s great to get feedback from clear NetHack enthusiasts!
The first thing is that the Oracle task is based on the condition that checks specifically for the oracle symbol. It does mean that from the point view of the probability of hallucinating the oracle, the chances are pretty low—unless you can stay multiple steps next to a creature. This last part is key, and it is by combining it with a NetHack command/action that the AI agent (Motif) is able to do so safely. You are right, surviving multiple timesteps without attacking the nearby monster is not possible. However, there is one command that sidesteps this difficulty: by pressing “Enter” the hallucinations rotate between characters without the time steps moving forward. This means that the monster standing next to the AI agent can not attack. The AI agent can then simply stand next to the monster and repeatedly press “Enter” until, eventually, the Oracle is hallucinated.
I agree, it is relative, NetHack has an amazing depth to it which makes it great to study, from the richness of the strategies and the many ways you can die. Currently Motif tends to go down some levels, but not as aggressively when compared to the baseline, which tends to go down fast, get surrounded by enemies and die. So perhaps from that point of view Motif is playing it a bit more safe given its abilities.
Did the model randomly stumble upon this strategy? Or was there an idea pitched by the language model, something like “hey, what if we try to hallucinate and maybe we can hack the game that way”?
Based on my understanding from talking with the author, it is the former. The language model is simply used to provide a shaping reward based on the text outputs that the game shows after some actions; it’s the RL optimization that learns the weird hallucination strategy, and the reason it’s able to do it is because its capabilities in general are improved thanks to the shaping reward.