You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
Yeah, that’s a legitimate difference from my initial position: wasn’t considering alternate setups like this when I wrote the post.
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
(1) Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal?
(2) Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
Consider an agent that’s been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What’s likely happening, internally?
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G.
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
As to whether it’s motivated to pursue the G-correlate because it’s a G-correlate — to answer that, we need to speculate on the internals of the “goal generator”. If it reliably spits out local G-correlates, even in environments it never saw before, doesn’t that imply that it has a representation of a context-independent correlate of G, which it uses as a starting point for deriving local goals?
If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would’ve been able to just memorize a goal for every environment it’s seen, and this procedure just de-compresses them.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Well, you do address this:
there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
… I don’t see a meaningful difference, here. There’s some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local G-correlate. What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
Or, perhaps a better question to ask is, what are some examples of these “decision-relevant factors in the environment”?
E. g., in the games example, I imagine something like:
The agent is exposed to the new environment; a multiplayer FPS, say.
It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
As it’s doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
Then it recognizes some factors in that abstract representation, goes “in environments like this, I must behave like this”, and “behave like this” is some efficient strategy for scoring the highest.
Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS’ actual mechanics.
Do you have something significantly different in mind?
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication
I still don’t see it. I imagine “deceptive alignment”, here, to mean something like:
“The agent knows G, and that scoring well at G reinforces its cognition, but it doesn’t care about G. Instead, it cares about some V. Whenever it notices its capabilities improve, it reasons that this’ll make it better at achieving V, so it attempts to do better at G because it wants the outer optimizer to preferentially reinforce said capabilities improvement.”
This lets it decouple its capabilities growth from G-caring: its reasoning starts from V, and only features G as an instrumental goal.
But what’s the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on G?
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
I don’t think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent’s behavioral capabilities (the actual business logic that carries out stuff like “recall the win conditions from relevantly-similar environments” and “do deductive reasoning” and “don’t die”), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent’s influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it’s already really good at fooling the discriminator on. This is something that happens all the time, under the label of “mode collapse”.
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent “wants” and what it doesn’t get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent “wants” and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it’ll always happen in autonomous learning setups.
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what “winning” means is different in each environment. The “goal generator” function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent’s world model and produces contextually-relevant action recommendations (like “take such-and-such immediate action”, or “set such-and-such as the current goal-image”), with this mapping having been learned from past reward events and self-supervised learning.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G .
Not hard-coded heuristics. Heuristics learned through experience. I don’t understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of “win the game” out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be “stand still for 30 seconds”, or “gather all the guns into a pile and light it on fire”? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it’s not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Depending on what you mean by OOD, I’m actually not sure if the sort of goal-generator you’re describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we’re choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
When I say “decision-relevant factors in the environment” I mean something like seeing that you’re in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other “team”. Not sure what “context-independent correlate of G” is. Was that my phrase or yours? 🤔
Do you have something significantly different in mind?
Nah that’s pretty similar to what I had in mind.
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an “expert” mode and a “novice” mode. The reward function in “expert” mode only gives rewards for winning, while in “novice” mode it also gives small rewards each turn based on material balance (to encourage “fair” play for new learners). Early in training the agent rapidly finds that there’s a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the “novice” condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the “expert” condition.
Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to “steer” conversions towards English whenever possible.
Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.
I think it’s the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
Yeah, that’s a legitimate difference from my initial position: wasn’t considering alternate setups like this when I wrote the post.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
Consider an agent that’s been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What’s likely happening, internally?
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G.
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
As to whether it’s motivated to pursue the G-correlate because it’s a G-correlate — to answer that, we need to speculate on the internals of the “goal generator”. If it reliably spits out local G-correlates, even in environments it never saw before, doesn’t that imply that it has a representation of a context-independent correlate of G, which it uses as a starting point for deriving local goals?
If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would’ve been able to just memorize a goal for every environment it’s seen, and this procedure just de-compresses them.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Well, you do address this:
… I don’t see a meaningful difference, here. There’s some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local G-correlate. What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
Or, perhaps a better question to ask is, what are some examples of these “decision-relevant factors in the environment”?
E. g., in the games example, I imagine something like:
The agent is exposed to the new environment; a multiplayer FPS, say.
It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
As it’s doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
Then it recognizes some factors in that abstract representation, goes “in environments like this, I must behave like this”, and “behave like this” is some efficient strategy for scoring the highest.
Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS’ actual mechanics.
Do you have something significantly different in mind?
I still don’t see it. I imagine “deceptive alignment”, here, to mean something like:
“The agent knows G, and that scoring well at G reinforces its cognition, but it doesn’t care about G. Instead, it cares about some V. Whenever it notices its capabilities improve, it reasons that this’ll make it better at achieving V, so it attempts to do better at G because it wants the outer optimizer to preferentially reinforce said capabilities improvement.”
This lets it decouple its capabilities growth from G-caring: its reasoning starts from V, and only features G as an instrumental goal.
But what’s the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on G?
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
I don’t think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent’s behavioral capabilities (the actual business logic that carries out stuff like “recall the win conditions from relevantly-similar environments” and “do deductive reasoning” and “don’t die”), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent’s influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it’s already really good at fooling the discriminator on. This is something that happens all the time, under the label of “mode collapse”.
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent “wants” and what it doesn’t get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent “wants” and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it’ll always happen in autonomous learning setups.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what “winning” means is different in each environment. The “goal generator” function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent’s world model and produces contextually-relevant action recommendations (like “take such-and-such immediate action”, or “set such-and-such as the current goal-image”), with this mapping having been learned from past reward events and self-supervised learning.
Not hard-coded heuristics. Heuristics learned through experience. I don’t understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of “win the game” out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be “stand still for 30 seconds”, or “gather all the guns into a pile and light it on fire”? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it’s not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
Depending on what you mean by OOD, I’m actually not sure if the sort of goal-generator you’re describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we’re choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
When I say “decision-relevant factors in the environment” I mean something like seeing that you’re in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other “team”. Not sure what “context-independent correlate of G” is. Was that my phrase or yours? 🤔
Nah that’s pretty similar to what I had in mind.
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an “expert” mode and a “novice” mode. The reward function in “expert” mode only gives rewards for winning, while in “novice” mode it also gives small rewards each turn based on material balance (to encourage “fair” play for new learners). Early in training the agent rapidly finds that there’s a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the “novice” condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the “expert” condition.
Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to “steer” conversions towards English whenever possible.
Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.
I think it’s the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.