I think that training setups that do not facilitate something like bootstrapping (i.e. modifying parameters even in some cases where there was no reward), are not competitive and will not produce AGIs
Yeah, I see that’s one of the main points of disconnect between our models. Not in the sense that I necessarily disagree, in the sense that I wasn’t familiar with this factor. We probably aren’t going to resolve this conclusively until I get around to reading the TD stuff (which I plan to do shortly).
Thanks for the links!
Maybe the model doesn’t have any context-agnostic “values” (not even “values” about pursuing R) until after it has some decent heuristic-generation machinery built up.
What’s it using the heuristics generator for, then? It’s a tool for figuring out how to pursue a goal in a context you’re unfamiliar with. But if you have no context-independent goals, you can’t define a goal over a context you’re unfamiliar with, so you don’t need the heuristics generator to begin with.
OR (the most likely scenario, IMO) maybe the outer objective performance is in fact correlated with the model’s ability to perform well according to its own values. For instance, the training process is teaching the model better general purpose heuristics-generation machinery, which will also make it better at pursuing its own values (because that machinery is generally useful no matter what your goals are).
Absolutely, I expect that to be the primary reason for deceptive alignment — once the model is smart enough for it.
But in this case, I argue that the heuristics generator will only be reinforced if its activity results in better performance along an outer-approved metric, which will only happen if it’s outputting heuristics useful for the outer-approved metric — which, in turn, will only happen if the model uses the heuristics generator to generate heuristics for an outer-approved value.
I’m not arguing that the heuristics generator will be specialized; I’m arguing that its improvements will be entangled with how it’s used.
E. g., two training episodes: in one the model asks for better heuristics for winning the race, in the other it asks for better donut-making heuristics.
In the former case, the heuristics generator will be reinforced, together with the model’s tendency to ask it for such heuristics.
In the latter, it wouldn’t be improved, nor would the tendency to ask it for this be reinforced.
Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.
(Or, rather, that the “command structure” around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to “build a boat” before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I’m not saying it’d be unable to use the heuristics generator flexibly.)
aren’t you defending the claim that the agent will value R/G, not that it will merely value some correlate of it?
Ehh, not exactly. I’m defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on; and that in a hypothetical “idealized” training setup, it’d care about G precisely. When I say things like “the heuristics generator will be asked for race-winning heuristics”, I really mean “the heuristics generator will be asked for heuristics that the model ultimately intends to use for a goal that is a close correlate of winning the race”, but that’s a mouthful.
Basically, I think there are two forces there:
What are the ultimate goals the heuristics generator is used for pursuing.
How powerful the heuristics generator is.
And the more powerful it is, the more tails come apart — the closer the goal it’s used for needs to be to G, for the agent’s performance on G to not degrade as the heuristics generator’s power grows (because the model starts being able to optimize for G-proxy so hard it decouples from G). So, until the model learns deceptive alignment, I’d expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to-G to counterbalance it, etc.
And so in the situation where the outer optimizer is the only source of reinforcement, we’d have the heuristics generator either:
Stagnate at some “power level” (if the model adamantly refuses to explore towards caring more about G).
Become gradually more and more pointed at G (until it becomes situationally aware and hacks out, obviously — which, outside idealized setups, will surely happen well before it’s actually pointed at G directly).
What’s it using the heuristics generator for, then? It’s a tool for figuring out how to pursue a goal in a context you’re unfamiliar with. But if you have no context-independent goals, you can’t define a goal over a context you’re unfamiliar with, so you don’t need the heuristics generator to begin with.
Why can’t you? The activations from observations coming in from the environment and from the agent’s internal state will activate some contextual decision-influences in the agent’s mind. Situational unfamiliarity does not mean its mind goes blank, any more than an OOD prompt makes GPT’s mind go blank. The agent is gonna think something when it wakes up in an environment, and that something will determine how and when the agent will call upon the heuristic-generator. Maybe it first queries it with a subgoal of “acquire information about my action space” or something, I dunno.
The agent that has a context-independent goal of “win the race” is in a similar predicament: it has no way of knowing a priori what “winning the race” requires or consists of in this unfamiliar environment (neither does its heuristic-generator), no way to ground this floating motivational pointer concretely. It’s gotta try stuff out and see what this environment actually rewards, just like everybody else. The agent could have a preexisting desire to pursue whatever “winning the race” looked like in past experiences. But I thought the whole point of this randomization/diversity business was to force the agent to latch onto “win the race” as an exclusive aim and not onto its common correlates, by thrusting the agent into an unfamiliar context each time around. If so, then previous correlates shouldn’t be reliable correlates anymore in this new context, right? Or else it can just learn to care about those rather than the goal you intended.
So I don’t see how the agent with a context-independent goal has an advantage in this setup when plopped down into an unfamiliar environment.
I’m not arguing that the heuristics generator will be specialized; I’m arguing that its improvements will be entangled with how it’s used.
I agree with this.
Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.
Why? I was imagining that the agent may prompt the heuristic-generator at multiple points within a single episode, inputting whatever subgoal it currently needs to generate heuristics for. If the agent is being put in super diverse environments, then these subgoals will be everything under the sun, so the heuristic-generator will have been prompted for lots of things. And if the agent is only being put in a narrow distribution of environments, then how is the heuristic-generator supposed to learn general-purpose heuristic-generation?
(Or, rather, that the “command structure” around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to “build a boat” before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I’m not saying it’d be unable to use the heuristics generator flexibly.)
Can there be additional layers of “command structure” on top of that? Like, can the agent have arrived at the “reasoning from what will help it win the race” thought by reasoning from something else? (Or is this a fixed part of the architecture?) If not, then won’t this have the problem that for a long time, the agent will be terrible at reasoning about what will help it win the race (especially in new environments), which means that starting with that will be a worse-performing strategy than starting with something else (like random exploration etc.)? And then that will disincentivize making this the first/outermost/unconditional function call? So then the agent learns not to unconditionally start with reasoning from that point, and instead to only sometimes reason from that point, conditional on context?
I’m defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on
Hmm. I am skeptical of that claim, though maybe less so depending on what exactly you mean[1].
Consider a different claim that seems mechanistically analogous to me:
The mean absolute fitness of a population tends to increase over the course of natural selection
Yes it is true that [differential reinforcement | relative fitness] is a selection pressure acting on the makeup of [things cared about | traits] across the [circuits | individuals] within a [agent | population], but AFAICT it is not true that the [agent | population] increases in [reward performance | absolute fitness] over the course of continual selection pressure.
So, until the model learns deceptive alignment, I’d expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to- G to counterbalance it, etc.
Yeah that may be a part of where our mental models differ. I don’t expect the balance of how much power the agent has over training vs. how close its goals are to the outer criterion to go in lockstep. I see “deceptive alignment” as part of a smooth continuum of agent-induced selection that can decouple the agent’s concerns from the optimization process’ criteria, with “the agent’s exploration is broken” as a label for the cognitively less sophisticated end of that continuum, and “deceptive alignment” as a label for the cognitively more sophisticated end of that continuum. And I think that that even the not-explicitly-intended pressures at the unsophisticated end of that continuum are quite strong, enough to make “the agent tends to be shaped to care about increasingly closer correlates of G” abstraction leak hard.
Like, for a given training run, as the training run progresses, the agent will be shaped to care about closer and closer correlates of G? (Just closer on average? Monotonically closer? What about converging at some non-G correlate?) Or like, among a bunch of training runs, as the training runs progress, the closeness of the [[maximally close to G] correlate that any agent cares about] to G keeps increasing?
Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?
(Also, do note if I’m failing to answer some important question you pose. I’m trying to condense responses and don’t answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)
Can there be additional layers of “command structure” on top of that? Like, can the agent have arrived at the “reasoning from what will help it win the race” thought by reasoning from something else?
Mm, yes, in a certain sense. Further refining: “over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of G regardless of the environment they’re in”. I do imagine that a given agent may orient themselves towards differentG-correlates depending on what specific stimuli they’ve been exposed to this episode/what context they’ve started out in. But I argue that it’ll tend to be a G-correlate, and that the average closeness of G-correlates across all contexts will tend to decrease as training goes on.
E. g., suppose the agent is trained on a large set of different games, and the intended G is to teach it to value winning. I argue that, if we successfully teach the agent autonomy (i. e., it wouldn’t just be a static bundle of heuristics, but it’d have a heuristics generator that’d allow it to adapt even to OOD games), there’d be some structure inside it which:
Analyses the game it’s in[1] and spits out some primary goal[2] it’s meant to achieve in it,
… and then all prompting of the heuristics-generator is downstream of that primary goal/in service to it,
… and that environment-specific goal is always a close correlate of G, such that pursuing it in this environment correlates with promoting G/would be highly reinforced by the outer optimizer[3],
… and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to G.
I see “deceptive alignment” as part of a smooth continuum of agent-induced selection that can decouple the agent’s concerns from the optimization process’ criteria, with “the agent’s exploration is broken” as a label for the cognitively less sophisticated end of that continuum
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
The agent’s goals can decouple all it wants, but it’ll only grow more advanced if it growing more advanced is preferentially reinforced by the outer optimizer. And that’ll only happen if it being more advanced is correlated with better performance on outer-approved metrics.
Which will only happen if it uses its growing advancedness to do better at the outer-approved metrics.
Which can happen either via deceptive alignment, or by it actually caring about the outer-approved metrics more (= caring about a closer correlate of the outer-approved metrics (= changing its “command structure” such that it tends to recover environment-specific primary goals that are a closer correlate of the outer-approved metrics in any given environment)).
And if it can’t yet do deceptive alignment, and its exploration policy is such that it just never explores “caring about a closer correlate of the outer-approved metrics”, its features never grow more advanced.
Which may be done by active actions too, as you suggested — this process might start with the agent setting “acquire information about my environment” as its first (temporary) goal, even before it derives its “terminal” goal.
Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?
Maybe? I dunno. It feels like the model that you are arguing for is qualitatively pretty different than the one I thought you were at the top of the thread (this might be my fault for misinterpreting the OP):
You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
You are arguing that in the limit, what the agent cares about will either tend to correlate more and more closely to outer performance or “peter out” (from our perspective) at some fixed level of sophistication, not arguing that in the limit, what the agent cares about will unconditionally tend to correlate more and more closely to outer performance
You are arguing that agents of growing sophistication will increasingly tend to pursue some goal that’s a natural interpretation of the intent of R, not arguing the agents of growing sophistication will increasingly tend to pursue R itself (i.e. making decisions on the basis of R, even where R and the intended goal come apart)
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
I don’t think I disagree all that much with what’s stated above. Somewhat skeptical most of the claims, but I could definitely be convinced.
(Also, do note if I’m failing to answer some important question you pose. I’m trying to condense responses and don’t answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
That’s fine. Again, I don’t think the setups where the end of episode rewards are only source of reinforcement are setups where the agent’s cognition can grow relevantly sophisticated in any case, regardless of decoupling.
Mm, yes, in a certain sense. Further refining: “over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of
regardless of the environment they’re in”. I do imagine that a given agent may orient themselves towards different
-correlates depending on what specific stimuli they’ve been exposed to this episode/what context they’ve started out in. But I argue that it’ll tend to be a
-correlate, and that the average closeness of
-correlates across all contexts will tend to [increase] as training goes on.
Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal? Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
… and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to
AFAICT it will spit out the sorts of goals that it has been historically reinforced for spitting out in relevantly-similar environments, but there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
I think (1) we probably won’t get sophisticated autonomous cognition within the kind of setup I think you’re imagining, regardless of coupling (2) knowing that the agent’s cognition won’t grow sophisticated in training-orthogonal ways seems kinda useful if we could do it, come to think of it.
And if it can’t yet do deceptive alignment, and its exploration policy is such that it just never explores “caring about a closer correlate of the outer-approved metrics”, its features never grow more advanced.
And so it stagnates and doesn’t go AGI.
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication. So I don’t see why we should expect that the outer optimizer will asymptotically succeed at instilling the goal. In order to do that, it needs to fully build in the right cognition before the agent reaches a level of sophistication where, in the same way as RL runs early on can “effectively stop exploring” and that locks in the current policy, RL runs later on (at the point where the agent is advanced in the way you describe) can “effectively stop directing its in-context learning (or whatever other mechanism you’re saying would allow it to continue growing in advancedness without actually caring about the outer metrics more) at the intended goal” and that locks in its not-quite-correct goal. To say that that won’t happen, that it will always either lock itself in before this point or end up aligned to a (very close correlate of) G, you need to make some very specific claims about the empirical balance of selection.
You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
Yeah, that’s a legitimate difference from my initial position: wasn’t considering alternate setups like this when I wrote the post.
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
(1) Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal?
(2) Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
Consider an agent that’s been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What’s likely happening, internally?
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G.
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
As to whether it’s motivated to pursue the G-correlate because it’s a G-correlate — to answer that, we need to speculate on the internals of the “goal generator”. If it reliably spits out local G-correlates, even in environments it never saw before, doesn’t that imply that it has a representation of a context-independent correlate of G, which it uses as a starting point for deriving local goals?
If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would’ve been able to just memorize a goal for every environment it’s seen, and this procedure just de-compresses them.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Well, you do address this:
there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
… I don’t see a meaningful difference, here. There’s some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local G-correlate. What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
Or, perhaps a better question to ask is, what are some examples of these “decision-relevant factors in the environment”?
E. g., in the games example, I imagine something like:
The agent is exposed to the new environment; a multiplayer FPS, say.
It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
As it’s doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
Then it recognizes some factors in that abstract representation, goes “in environments like this, I must behave like this”, and “behave like this” is some efficient strategy for scoring the highest.
Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS’ actual mechanics.
Do you have something significantly different in mind?
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication
I still don’t see it. I imagine “deceptive alignment”, here, to mean something like:
“The agent knows G, and that scoring well at G reinforces its cognition, but it doesn’t care about G. Instead, it cares about some V. Whenever it notices its capabilities improve, it reasons that this’ll make it better at achieving V, so it attempts to do better at G because it wants the outer optimizer to preferentially reinforce said capabilities improvement.”
This lets it decouple its capabilities growth from G-caring: its reasoning starts from V, and only features G as an instrumental goal.
But what’s the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on G?
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
I don’t think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent’s behavioral capabilities (the actual business logic that carries out stuff like “recall the win conditions from relevantly-similar environments” and “do deductive reasoning” and “don’t die”), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent’s influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it’s already really good at fooling the discriminator on. This is something that happens all the time, under the label of “mode collapse”.
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent “wants” and what it doesn’t get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent “wants” and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it’ll always happen in autonomous learning setups.
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what “winning” means is different in each environment. The “goal generator” function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent’s world model and produces contextually-relevant action recommendations (like “take such-and-such immediate action”, or “set such-and-such as the current goal-image”), with this mapping having been learned from past reward events and self-supervised learning.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G .
Not hard-coded heuristics. Heuristics learned through experience. I don’t understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of “win the game” out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be “stand still for 30 seconds”, or “gather all the guns into a pile and light it on fire”? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it’s not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Depending on what you mean by OOD, I’m actually not sure if the sort of goal-generator you’re describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we’re choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
When I say “decision-relevant factors in the environment” I mean something like seeing that you’re in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other “team”. Not sure what “context-independent correlate of G” is. Was that my phrase or yours? 🤔
Do you have something significantly different in mind?
Nah that’s pretty similar to what I had in mind.
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an “expert” mode and a “novice” mode. The reward function in “expert” mode only gives rewards for winning, while in “novice” mode it also gives small rewards each turn based on material balance (to encourage “fair” play for new learners). Early in training the agent rapidly finds that there’s a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the “novice” condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the “expert” condition.
Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to “steer” conversions towards English whenever possible.
Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.
I think it’s the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.
Yeah, I see that’s one of the main points of disconnect between our models. Not in the sense that I necessarily disagree, in the sense that I wasn’t familiar with this factor. We probably aren’t going to resolve this conclusively until I get around to reading the TD stuff (which I plan to do shortly).
Thanks for the links!
What’s it using the heuristics generator for, then? It’s a tool for figuring out how to pursue a goal in a context you’re unfamiliar with. But if you have no context-independent goals, you can’t define a goal over a context you’re unfamiliar with, so you don’t need the heuristics generator to begin with.
Absolutely, I expect that to be the primary reason for deceptive alignment — once the model is smart enough for it.
But in this case, I argue that the heuristics generator will only be reinforced if its activity results in better performance along an outer-approved metric, which will only happen if it’s outputting heuristics useful for the outer-approved metric — which, in turn, will only happen if the model uses the heuristics generator to generate heuristics for an outer-approved value.
I’m not arguing that the heuristics generator will be specialized; I’m arguing that its improvements will be entangled with how it’s used.
E. g., two training episodes: in one the model asks for better heuristics for winning the race, in the other it asks for better donut-making heuristics.
In the former case, the heuristics generator will be reinforced, together with the model’s tendency to ask it for such heuristics.
In the latter, it wouldn’t be improved, nor would the tendency to ask it for this be reinforced.
Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.
(Or, rather, that the “command structure” around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to “build a boat” before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I’m not saying it’d be unable to use the heuristics generator flexibly.)
Ehh, not exactly. I’m defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on; and that in a hypothetical “idealized” training setup, it’d care about G precisely. When I say things like “the heuristics generator will be asked for race-winning heuristics”, I really mean “the heuristics generator will be asked for heuristics that the model ultimately intends to use for a goal that is a close correlate of winning the race”, but that’s a mouthful.
Basically, I think there are two forces there:
What are the ultimate goals the heuristics generator is used for pursuing.
How powerful the heuristics generator is.
And the more powerful it is, the more tails come apart — the closer the goal it’s used for needs to be to G, for the agent’s performance on G to not degrade as the heuristics generator’s power grows (because the model starts being able to optimize for G-proxy so hard it decouples from G). So, until the model learns deceptive alignment, I’d expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to-G to counterbalance it, etc.
And so in the situation where the outer optimizer is the only source of reinforcement, we’d have the heuristics generator either:
Stagnate at some “power level” (if the model adamantly refuses to explore towards caring more about G).
Become gradually more and more pointed at G (until it becomes situationally aware and hacks out, obviously — which, outside idealized setups, will surely happen well before it’s actually pointed at G directly).
Why can’t you? The activations from observations coming in from the environment and from the agent’s internal state will activate some contextual decision-influences in the agent’s mind. Situational unfamiliarity does not mean its mind goes blank, any more than an OOD prompt makes GPT’s mind go blank. The agent is gonna think something when it wakes up in an environment, and that something will determine how and when the agent will call upon the heuristic-generator. Maybe it first queries it with a subgoal of “acquire information about my action space” or something, I dunno.
The agent that has a context-independent goal of “win the race” is in a similar predicament: it has no way of knowing a priori what “winning the race” requires or consists of in this unfamiliar environment (neither does its heuristic-generator), no way to ground this floating motivational pointer concretely. It’s gotta try stuff out and see what this environment actually rewards, just like everybody else. The agent could have a preexisting desire to pursue whatever “winning the race” looked like in past experiences. But I thought the whole point of this randomization/diversity business was to force the agent to latch onto “win the race” as an exclusive aim and not onto its common correlates, by thrusting the agent into an unfamiliar context each time around. If so, then previous correlates shouldn’t be reliable correlates anymore in this new context, right? Or else it can just learn to care about those rather than the goal you intended.
So I don’t see how the agent with a context-independent goal has an advantage in this setup when plopped down into an unfamiliar environment.
I agree with this.
Why? I was imagining that the agent may prompt the heuristic-generator at multiple points within a single episode, inputting whatever subgoal it currently needs to generate heuristics for. If the agent is being put in super diverse environments, then these subgoals will be everything under the sun, so the heuristic-generator will have been prompted for lots of things. And if the agent is only being put in a narrow distribution of environments, then how is the heuristic-generator supposed to learn general-purpose heuristic-generation?
Can there be additional layers of “command structure” on top of that? Like, can the agent have arrived at the “reasoning from what will help it win the race” thought by reasoning from something else? (Or is this a fixed part of the architecture?) If not, then won’t this have the problem that for a long time, the agent will be terrible at reasoning about what will help it win the race (especially in new environments), which means that starting with that will be a worse-performing strategy than starting with something else (like random exploration etc.)? And then that will disincentivize making this the first/outermost/unconditional function call? So then the agent learns not to unconditionally start with reasoning from that point, and instead to only sometimes reason from that point, conditional on context?
Hmm. I am skeptical of that claim, though maybe less so depending on what exactly you mean[1].
Consider a different claim that seems mechanistically analogous to me:
Yes it is true that [differential reinforcement | relative fitness] is a selection pressure acting on the makeup of [things cared about | traits] across the [circuits | individuals] within a [agent | population], but AFAICT it is not true that the [agent | population] increases in [reward performance | absolute fitness] over the course of continual selection pressure.
Yeah that may be a part of where our mental models differ. I don’t expect the balance of how much power the agent has over training vs. how close its goals are to the outer criterion to go in lockstep. I see “deceptive alignment” as part of a smooth continuum of agent-induced selection that can decouple the agent’s concerns from the optimization process’ criteria, with “the agent’s exploration is broken” as a label for the cognitively less sophisticated end of that continuum, and “deceptive alignment” as a label for the cognitively more sophisticated end of that continuum. And I think that that even the not-explicitly-intended pressures at the unsophisticated end of that continuum are quite strong, enough to make “the agent tends to be shaped to care about increasingly closer correlates of G” abstraction leak hard.
EDIT: Moved some stuff into a footnote.
Like, for a given training run, as the training run progresses, the agent will be shaped to care about closer and closer correlates of G? (Just closer on average? Monotonically closer? What about converging at some non-G correlate?) Or like, among a bunch of training runs, as the training runs progress, the closeness of the [[maximally close to G] correlate that any agent cares about] to G keeps increasing?
Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?
(Also, do note if I’m failing to answer some important question you pose. I’m trying to condense responses and don’t answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)
Mm, yes, in a certain sense. Further refining: “over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of G regardless of the environment they’re in”. I do imagine that a given agent may orient themselves towards different G-correlates depending on what specific stimuli they’ve been exposed to this episode/what context they’ve started out in. But I argue that it’ll tend to be a G-correlate, and that the average closeness of G-correlates across all contexts will tend to decrease as training goes on.
E. g., suppose the agent is trained on a large set of different games, and the intended G is to teach it to value winning. I argue that, if we successfully teach the agent autonomy (i. e., it wouldn’t just be a static bundle of heuristics, but it’d have a heuristics generator that’d allow it to adapt even to OOD games), there’d be some structure inside it which:
Analyses the game it’s in[1] and spits out some primary goal[2] it’s meant to achieve in it,
… and then all prompting of the heuristics-generator is downstream of that primary goal/in service to it,
… and that environment-specific goal is always a close correlate of G, such that pursuing it in this environment correlates with promoting G/would be highly reinforced by the outer optimizer[3],
… and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to G.
(This is what my giant post is all about.)
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
The agent’s goals can decouple all it wants, but it’ll only grow more advanced if it growing more advanced is preferentially reinforced by the outer optimizer. And that’ll only happen if it being more advanced is correlated with better performance on outer-approved metrics.
Which will only happen if it uses its growing advancedness to do better at the outer-approved metrics.
Which can happen either via deceptive alignment, or by it actually caring about the outer-approved metrics more (= caring about a closer correlate of the outer-approved metrics (= changing its “command structure” such that it tends to recover environment-specific primary goals that are a closer correlate of the outer-approved metrics in any given environment)).
And if it can’t yet do deceptive alignment, and its exploration policy is such that it just never explores “caring about a closer correlate of the outer-approved metrics”, its features never grow more advanced.
And so it stagnates and doesn’t go AGI.
Which may be done by active actions too, as you suggested — this process might start with the agent setting “acquire information about my environment” as its first (temporary) goal, even before it derives its “terminal” goal.
Or some weighted set of goals.
Though it’s not necessarily even the actual win condition of the specific game, just something closely correlated with it.
Maybe? I dunno. It feels like the model that you are arguing for is qualitatively pretty different than the one I thought you were at the top of the thread (this might be my fault for misinterpreting the OP):
You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
You are arguing that in the limit, what the agent cares about will either tend to correlate more and more closely to outer performance or “peter out” (from our perspective) at some fixed level of sophistication, not arguing that in the limit, what the agent cares about will unconditionally tend to correlate more and more closely to outer performance
You are arguing that agents of growing sophistication will increasingly tend to pursue some goal that’s a natural interpretation of the intent of R, not arguing the agents of growing sophistication will increasingly tend to pursue R itself (i.e. making decisions on the basis of R, even where R and the intended goal come apart)
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
I don’t think I disagree all that much with what’s stated above. Somewhat skeptical most of the claims, but I could definitely be convinced.
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
That’s fine. Again, I don’t think the setups where the end of episode rewards are only source of reinforcement are setups where the agent’s cognition can grow relevantly sophisticated in any case, regardless of decoupling.
Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal? Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
AFAICT it will spit out the sorts of goals that it has been historically reinforced for spitting out in relevantly-similar environments, but there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
I think (1) we probably won’t get sophisticated autonomous cognition within the kind of setup I think you’re imagining, regardless of coupling (2) knowing that the agent’s cognition won’t grow sophisticated in training-orthogonal ways seems kinda useful if we could do it, come to think of it.
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication. So I don’t see why we should expect that the outer optimizer will asymptotically succeed at instilling the goal. In order to do that, it needs to fully build in the right cognition before the agent reaches a level of sophistication where, in the same way as RL runs early on can “effectively stop exploring” and that locks in the current policy, RL runs later on (at the point where the agent is advanced in the way you describe) can “effectively stop directing its in-context learning (or whatever other mechanism you’re saying would allow it to continue growing in advancedness without actually caring about the outer metrics more) at the intended goal” and that locks in its not-quite-correct goal. To say that that won’t happen, that it will always either lock itself in before this point or end up aligned to a (very close correlate of) G, you need to make some very specific claims about the empirical balance of selection.
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
Yeah, that’s a legitimate difference from my initial position: wasn’t considering alternate setups like this when I wrote the post.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
Consider an agent that’s been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What’s likely happening, internally?
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G.
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
As to whether it’s motivated to pursue the G-correlate because it’s a G-correlate — to answer that, we need to speculate on the internals of the “goal generator”. If it reliably spits out local G-correlates, even in environments it never saw before, doesn’t that imply that it has a representation of a context-independent correlate of G, which it uses as a starting point for deriving local goals?
If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would’ve been able to just memorize a goal for every environment it’s seen, and this procedure just de-compresses them.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Well, you do address this:
… I don’t see a meaningful difference, here. There’s some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local G-correlate. What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
Or, perhaps a better question to ask is, what are some examples of these “decision-relevant factors in the environment”?
E. g., in the games example, I imagine something like:
The agent is exposed to the new environment; a multiplayer FPS, say.
It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
As it’s doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
Then it recognizes some factors in that abstract representation, goes “in environments like this, I must behave like this”, and “behave like this” is some efficient strategy for scoring the highest.
Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS’ actual mechanics.
Do you have something significantly different in mind?
I still don’t see it. I imagine “deceptive alignment”, here, to mean something like:
“The agent knows G, and that scoring well at G reinforces its cognition, but it doesn’t care about G. Instead, it cares about some V. Whenever it notices its capabilities improve, it reasons that this’ll make it better at achieving V, so it attempts to do better at G because it wants the outer optimizer to preferentially reinforce said capabilities improvement.”
This lets it decouple its capabilities growth from G-caring: its reasoning starts from V, and only features G as an instrumental goal.
But what’s the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on G?
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
I don’t think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent’s behavioral capabilities (the actual business logic that carries out stuff like “recall the win conditions from relevantly-similar environments” and “do deductive reasoning” and “don’t die”), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent’s influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it’s already really good at fooling the discriminator on. This is something that happens all the time, under the label of “mode collapse”.
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent “wants” and what it doesn’t get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent “wants” and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it’ll always happen in autonomous learning setups.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what “winning” means is different in each environment. The “goal generator” function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent’s world model and produces contextually-relevant action recommendations (like “take such-and-such immediate action”, or “set such-and-such as the current goal-image”), with this mapping having been learned from past reward events and self-supervised learning.
Not hard-coded heuristics. Heuristics learned through experience. I don’t understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of “win the game” out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be “stand still for 30 seconds”, or “gather all the guns into a pile and light it on fire”? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it’s not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
Depending on what you mean by OOD, I’m actually not sure if the sort of goal-generator you’re describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we’re choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
When I say “decision-relevant factors in the environment” I mean something like seeing that you’re in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other “team”. Not sure what “context-independent correlate of G” is. Was that my phrase or yours? 🤔
Nah that’s pretty similar to what I had in mind.
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an “expert” mode and a “novice” mode. The reward function in “expert” mode only gives rewards for winning, while in “novice” mode it also gives small rewards each turn based on material balance (to encourage “fair” play for new learners). Early in training the agent rapidly finds that there’s a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the “novice” condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the “expert” condition.
Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to “steer” conversions towards English whenever possible.
Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.
I think it’s the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.