The examples found “in the wild” (cultural transmission, InstructGPT) involved no coaxing at all. Details for the other examples (going off of memory, probably some of this will be wrong, but it should be right in broad strokes):
Monster gridworld: We knew from the beginning that the mechanism we wanted was “agent needs to collect shields in training episodes; over longer time horizons it should collect apples but it will continue to collect shields because shields were way more important during training”. We had to play around with the setup quite a bit before we got the relatively clean results in the paper. Two canonical examples of issues:
The agent learned to run around the gridworld to avoid the monsters instead of picking up shields. We fixed this by making monsters faster than the agent.
The agent didn’t learn competent path planning (and instead looked like it was moving around somewhat randomly). I don’t remember exactly why this was, but it might have been that the apples / shields were too densely packed in the environment and so there wasn’t much benefit to competent path planning (in which case we probably solved it by reducing the number of apples / shields or increasing the size of the gridworld).
Tree gridworld: This was originally supposed to be the same sort of environment as Monster gridworld, but with different hyperparameters to showcase the same issue for non-episodic / never-ending / continual learning RL. Our biggest issue here was that we failed to find an RL algorithm that actually worked for this; the agent typically didn’t even learn to collect shields. We spent quite a while trying to fix this before we realized we could simplify the environment by removing the shields and still show a similar issue; with this simpler environment the agent finally started to learn. After that there was a bit of tweaking of hyperparameters but I think it worked pretty quickly.
Evaluating Linear Expressions: I think for this one we thought “well one way you could get GMG is if a task required you to gather information to solve it, and the AI learns that information gathering is valuable for its own sake”, which then turned into this idea, which then worked immediately.
We tried lots of other things that never ended up being good enough to put in the paper. For example, one hypothesis we had was that an LLM that summarized news articles might learn that “stating real-world facts is important” and so when summarizing LW essays that don’t have real-world facts, it might make up facts. When we tested this iirc it did sometimes make up facts but the overall vibe was “it does weird stuff” rather than “it competently pursues a misgeneralized goal”.
seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn’t the default in most situations. the problem is, those misgeneralizations might be surprisingly bad if the competence is strong enough. these are specifically situations where empowerment is reliable but purpose is confusing, yeah? and it seems like language models would be an exception to that because their empowerment and purpose are deeply tied.
seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn’t the default in most situations.
Sure, I endorse that conclusion today, when systems aren’t particularly general / competent. I don’t endorse that conclusion for the future, when systems will predictably become more general / competent.
(And if you take language models and put them in weird circumstances, they still look competent on some axes, they’re just weird enough that we had trouble attributing any simple goal to them.)
I’m not sure I understand what you mean by empowerment and purpose as it relates to language models, can you say it a different way?
empowerment as in ability to control an environment; I just wanted to use a different term of art because it felt more appropriate, despite not being evaluated directly, empowerment is the question we care about out of capability, is it not?
I understand that part, but I’m not seeing what you mean by empowerment being reliable but purpose being confusing, and why language models are an exception to that.
The generative modeling objective applied to human datasets only makes behavior that causes empowerment because doing so correlates with behavior that causes accuracy; a reinforcement learning objective applied to the same dataset will still learn the convergent empowerment capability well, but the reward signal is relatively sparse, the model will fit whatever happens to be going on at the time.
in general it seems like the thing all of the example situations have in common is much less dense feedback from anything approaching a true objective.
situations where it’s obvious how to assemble steps to get things, but confusing which results of the different combinations are the ones you really want, are ones where feedback is hard to be sure you have pushed into the correct dimensions. or something.
How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?
The examples found “in the wild” (cultural transmission, InstructGPT) involved no coaxing at all. Details for the other examples (going off of memory, probably some of this will be wrong, but it should be right in broad strokes):
Monster gridworld: We knew from the beginning that the mechanism we wanted was “agent needs to collect shields in training episodes; over longer time horizons it should collect apples but it will continue to collect shields because shields were way more important during training”. We had to play around with the setup quite a bit before we got the relatively clean results in the paper. Two canonical examples of issues:
The agent learned to run around the gridworld to avoid the monsters instead of picking up shields. We fixed this by making monsters faster than the agent.
The agent didn’t learn competent path planning (and instead looked like it was moving around somewhat randomly). I don’t remember exactly why this was, but it might have been that the apples / shields were too densely packed in the environment and so there wasn’t much benefit to competent path planning (in which case we probably solved it by reducing the number of apples / shields or increasing the size of the gridworld).
Tree gridworld: This was originally supposed to be the same sort of environment as Monster gridworld, but with different hyperparameters to showcase the same issue for non-episodic / never-ending / continual learning RL. Our biggest issue here was that we failed to find an RL algorithm that actually worked for this; the agent typically didn’t even learn to collect shields. We spent quite a while trying to fix this before we realized we could simplify the environment by removing the shields and still show a similar issue; with this simpler environment the agent finally started to learn. After that there was a bit of tweaking of hyperparameters but I think it worked pretty quickly.
Evaluating Linear Expressions: I think for this one we thought “well one way you could get GMG is if a task required you to gather information to solve it, and the AI learns that information gathering is valuable for its own sake”, which then turned into this idea, which then worked immediately.
We tried lots of other things that never ended up being good enough to put in the paper. For example, one hypothesis we had was that an LLM that summarized news articles might learn that “stating real-world facts is important” and so when summarizing LW essays that don’t have real-world facts, it might make up facts. When we tested this iirc it did sometimes make up facts but the overall vibe was “it does weird stuff” rather than “it competently pursues a misgeneralized goal”.
seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn’t the default in most situations. the problem is, those misgeneralizations might be surprisingly bad if the competence is strong enough. these are specifically situations where empowerment is reliable but purpose is confusing, yeah? and it seems like language models would be an exception to that because their empowerment and purpose are deeply tied.
Sure, I endorse that conclusion today, when systems aren’t particularly general / competent. I don’t endorse that conclusion for the future, when systems will predictably become more general / competent.
(And if you take language models and put them in weird circumstances, they still look competent on some axes, they’re just weird enough that we had trouble attributing any simple goal to them.)
I’m not sure I understand what you mean by empowerment and purpose as it relates to language models, can you say it a different way?
empowerment as in ability to control an environment; I just wanted to use a different term of art because it felt more appropriate, despite not being evaluated directly, empowerment is the question we care about out of capability, is it not?
and by purpose I simply meant goal.
I understand that part, but I’m not seeing what you mean by empowerment being reliable but purpose being confusing, and why language models are an exception to that.
The generative modeling objective applied to human datasets only makes behavior that causes empowerment because doing so correlates with behavior that causes accuracy; a reinforcement learning objective applied to the same dataset will still learn the convergent empowerment capability well, but the reward signal is relatively sparse, the model will fit whatever happens to be going on at the time.
in general it seems like the thing all of the example situations have in common is much less dense feedback from anything approaching a true objective.
situations where it’s obvious how to assemble steps to get things, but confusing which results of the different combinations are the ones you really want, are ones where feedback is hard to be sure you have pushed into the correct dimensions. or something.