Say, the agent keeps on getting less reward by using its heuristic than it would have if it were using something close to R, but since it isn’t taking the actions that lead to that higher reward, it keeps getting the existing heuristic reinforced by the small rewards
Fair point. I’m more used to thinking in terms of SSL, not RL, so I sometimes forget to account for the exploration policy. (Although now I’m tempted to say that any AGI-causing exploration policy would need to be fairly curious (to, e. g., hit upon weird strategies like “invent technology”), so it would tend to discover such opportunities more often than not.)
But even if there aren’t always gradients towards maximallyR—promoting behavior, why would—
the abstraction of “selection processes only select traits that serve the selection criterion” is incredibly leaky
—there be gradients towards behavior that decreases performance on R or is orthogonal to R, as you seem to imply here? Why would that kind of cognition be reinforced?
As we’re talking about building autonomous agents, I’m generally imagining that training includes some substantial part where the agent is autonomously making choices that have consequences on what training data/feedback it gets afterwards. (I don’t particularly care if this is “RL” or “online SL” or “iterated chain-of-thought distillation” or something else.) A smart agent in the real world must be highly selective about the manner in which it explores, because most ways of exploring don’t lead anywhere fruitful (wandering around in a giant desert) or lead to dead ends (walking off a cliff).
But even if there aren’t always gradients towards maximally-
R
-promoting behavior, why would [...] there be gradients towards behavior that decreases performance on
R
or are orthogonal to
R
, as you seem to imply here? Why would that kind of cognition be reinforced?
There need not be outer gradients towards that behavior. Two things interact to determine what feedback/gradients are actually produced during training:
The selection criterion
The agent and its choices/computations
Backpropagation kinda weakly has this feature, because we take the derivative of the function at the argument of the function, which means that if the model’s computational graph has a branch, we only calculate gradients based on the branch that the model actually went down for the batch example(s). RL methods naturally have this feature, as the policy determines the trajectories which determine the empirical returns which determine the updates. Chain-of-thought training methods should have this feature too, because presumably the network decides exactly what chain-of-thought it produces, which determines what chains-of-thought are available for feedback.
“The agent not exploring in some particular way” is one of many possible examples of how the effect of 1&2 can be radically different from the theoretical effect of 1 alone. These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.
These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.
Can you provide a short concrete example, to check that I’m picturing this right?
Sure thing. Three concrete examples, in order of increasing weirdness:
Early in training the CoastRunners boat accidentally does a donut in the lagoon. That makes it score points and get a reward. That reinforces the donut behavior. That prevents it from reaching the finish line with any regularity, which selects against game-completing behavior.
We take a pretrained MuZero chess policy and try to tune it with a reward function that outputs 1 whenever the model sends its king out unprotected directly into the line of enemy pawns and 0 otherwise. But our “selection” based on this reward function is ineffectual even when using MuZero’s advanced model-based RL algorithms. This is because the sampled rollouts guided by that policy never reach the rewarded state (notice that even without the bias from the policy, any attempted remedy will have to deal with hitting on a narrow part of the exponentially large space of rollouts) so the value function can’t update on it. This is because the policy is steering away from the precursors of that state. This is because that policy previously learned to select actions that protect the king and actions that keep it far away from the center of the board.
OpenAI uses a galaxy-brained form of chain-of-thought tuning to train GPT-7, wherein they have the model first annotate a training datapoint with contextually-relevant information that it retrieves from a read-write database and then stores the generated outputs back into the database. Because of the ordering in the training datapoints, the model early on learns a tendency that causes it to more frequently retrieve the Navy Seal copypasta. This causes the model to progressively learn to ignore the input it is annotating and biases it towards generating more Navy Seal copypasta-like outputs. This selects against all other patterns of behavior; GPT-7 is very clever at carrying out its desires, so it doesn’t unlearn the behavior even if you give it an explicit instruction like “do not use any copypasta” (maybe it understands perfectly well what you mean but instead adds text like “<|endoftext|> # Navy Seal Copypasta”) or if you add a filter to check discard outputs that contain the world “Navy”. The model’s learned tendencies chain into themselves across computational steps and reinforce themselves into an unintended-by-us fixed point.
Okay, suppose we have a “chisel” that’s more-or-less correctly shaped around some goal G that’s easy to describe in terms of natural abstractions. In CoastRunners, it would be “win the race”[1]; with MuZero, “win the game”; with GPT-N, something like “infer the current scenario and simulate it” or “pretend to be this person”. I’d like to clarify that this is what I meant by R — I didn’t mean that in the limit of perfect training, agents would become wireheads, I meant they’d be correctly aligned to the natural goal G implied by the reinforcement schedule.
The “easiness of description” of G in terms of natural abstractions is an important variable. Some reinforcement schedules can be very incoherent, e. g. rewarding winning the race in some scenarios and punishing it in others, purely based on the presence/absence of some random features in each scenario. In this case, the shortest description of the reinforcement schedule is just “the reinforcement function itself” — that would be the implied G.
It’s not completely unrealistic, either — the human reward circuitry is varied enough that hedonism is a not-too-terrible description of the implied goal. But it’s not a central example in my mind. Inasmuch as there’s some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals G1∧G2∧...∧Gn implicit in the reinforcement schedule.
Now, to get to AGI, we need autonomy. We need a training setup which will build a heuristics generator into the AGI, and then improve that heuristics generator until it has a lot of flexible capability. That means, essentially, introducing the AGI to scenarios it’s never encountered before[2], and somehow shaping it to pass them on the first try (= for it to do something that will get reinforced).
As a CoastRunners example, consider scenarios where the race is suddenly in 3D, or in space and the “ship” is a spaceship, or the AGI is exposed to the realistic controls of the ship instead of WASD, or it needs to “win the race” by designing the fastest ship instead of actually racing, or it’s not the pilot but it wins by training the most competent pilot, or there’s a lot of weird rules to the race now, or the win condition is weird, et cetera.
Inasmuch as the heuristics generator is aligned with the implicit goal G, we’ll get an agent that looks at the context, infers what it means to “win the race” here and what it needs to do to win the race, then start directly optimizing for that. This is what we “want” our training to result in.
In this, we can be more or less successful along various dimensions:
The more varied the training scenarios are, the more clearly the training is to shape the agent into valuing winning the race, instead of any of the upstream correlates of that. “Win the race” would be the unifying factor across all reinforcement schedule structures in all of these contexts.
Likewise, the more coherent the reinforcement schedule is — the more it rewards actions that are strongly correlated with acting towards winning the race, instead of anything else — the more clearly it shapes the agent to be valuing winning, instead of whatever arbitrary thing it may end up doing.
The more “adversity” the agent encounters, the more likely it is to care only about winning. If there are scenarios where it has very few resources, but which are just enough to win if it applies them solely to winning instead of spending them on any other goal, the more it will be shaped to care only about that goal to the exclusion of (and at the expense of) everything else.
As we increase adversity and scenario diversity, the more “curious” we’ll have to make the agent’s exploration policy (to hit upon the most optimal strategies). On the flipside, we want it to have to invent creative solutions to win, as part of trying to train an AGI — so we will ramp up the adversity and the diversity. And we’d want to properly reinforce said creativity, so we’d (somehow) shape our reinforcement schedule to properly reinforce it.
Thus, there’s a correlated cluster of training parameters that increases our chances of getting an AGI: we have to put it in varied highly-adversarial scenarios to make creativity/autonomy necessary, we have to ramp up its “curiosity” to ensure it can invent creative solutions and be autonomous, and to properly reinforce all of this (and not just random behavior), we have to have a highly-coherent credit assignment system that’s able to somehow recognize the instrumental value of weird creativity and reinforce it more than random loitering around.
To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.
And by creativity’s nature of being weird, we can’t just have a “reinforce creativity” function. We’d need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be G. And indeed, this creativity-recognizing property is correlated with the reinforcement schedule’s coherency — inasmuch as R is well-described as shaped around G, it should reinforce (and not fail to reinforce) weird creativity that promotes G! Thus, we get a credit assignment system that effectively cultivates the features that’d lead to AGI (an increasingly advanced heuristics generator), but it’s done at the “cost” of making those features accurately pointed at G[3].
And this, incidentally, are the exact parameters necessary to make the training setup more “idealized”. Strictly specify G, build it into the agent, try to update away mesa-objectives that aren’t G, make it optimize for G strongly, etc.
In practice, we’ll fall short of this ideal: we’ll fail to introduce variance enough to uniquely specify winning, we’ll reinforce upstream correlates of winning and end up with an AGI that values lots of things upstream of winning, we’ll fail to have enough adversity to counterbalance this and update its other goals away, and we won’t get a perfect exploratory policy that always converges towards the actions R would reinforce the most.
But a training process’ ability to result in an AGI is anti-correlated with its distance from the aforementioned ideal.
Thus, inasmuch as we’re successful in setting up a training process that results in an AGI, we’ll end up with an agent that’s some approximation of a G-maximizing wrapper-mind.
Actually, no, apparently it’s “smash into specific objects”. How did they expect anything else to happen? Okay, but let’s pretend I’m talking about some more clearly set up version of CoastRunners, in which the simplest description of the reinforcement schedule is “when you win the race”.
More specifically, to scenarios it doesn’t have a ready-made suite of shallow heuristics for solving. It may be because the scenario is completely novel, or because the AGI did encounter it before, but it was long ago, and it got pushed out of its limited memory by more recent scenarios.
To rephrase a bit: The heuristics generator will be reinforced more if it’s pointed at G, so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at G, because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can’t robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn’t reinforce the heuristics generator a lot, and doesn’t preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn’t cultivate the capabilities needed for AGI, and therefore doesn’t result in AGI.
Okay, suppose we have a “chisel” that’s more-or-less correctly shaped around some goal G that’s easy to describe in terms of natural abstractions. In CoastRunners, it would be “win the race”[1]; with MuZero, “win the game”; with GPT-N, something like “infer the current scenario and simulate it” or “pretend to be this person”. I’d like to clarify that this is what I meant by R — I didn’t mean that in the limit of perfect training, agents would become wireheads, I meant they’d be correctly aligned to the natural goal G implied by the reinforcement schedule.
Doesn’t this sound weird to you? I don’t think of the chisel itself being “shaped around” the intended form, but rather, the chisel is a tool that is used to shape the statue so that the statue reflects that form. The chisel does not need to be shaped like the intended form for this to work! Recall that the reinforcement schedule is not a pure function of the reward/loss calculator, it is a function of both that and the way the policy behaves over training (the thing I was describing earlier as “The agent and its choices/computations”), which means that if we only specify the outer objective R, there may be no fact of the matter about which goal is “implied” as its natural / coherent extrapolation. It’s a 2-place function and we’ve only provided 1 argument so far.
I get your point on some vibe-level. Like, humans and other animal agents can often infer what goal another agent is trying to communicate. For instance, when I’m training a dog to sit and I keep rewarding it whenever it sits but not when it lays down or stands, we can talk about how it is contextually “implied” that the dog should sit. But most of what makes this work is not that I used a reward criterion that sharply approximates some idealized sitting recognition function (it does need to bear nonzero relation to sitting); most of the work is done by the close fit between the dog’s current behavioral repertoire and the behavior I want to train, and by the fact that the dog itself is already motivated to test out different behaviors because it likes my doggie treats, and by the way in which I use rewards as a tool to create a behavior-shaping positive feedback loop.
Inasmuch as there’s some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals G1∧G2∧...∧Gn implicit in the reinforcement schedule.
In practice I agree (I think, not quite sure if I get the disjunction bit). That is one reason I expect agents to not want to reconfigure themselves into wrapper-mind, because the agent has settled on many different overlapping goals all of which it endorses, and those goals don’t form a total preorder over outcomes for it to become a wrapper-mind pursuing.
To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.
I agree with this. For modern humans, I would say that this is provided by our evolutionary history + our many years of individual cognitive development + our schooling.
And by creativity’s nature of being weird, we can’t just have a “reinforce creativity” function. We’d need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be G. And indeed, this creativity-recognizing property is correlated with the reinforcement schedule’s coherency — inasmuch as R is well-described as shaped around G, it should reinforce (and not fail to reinforce) weird creativity that promotes G! Thus, we get a credit assignment system that effectively cultivates the features that’d lead to AGI (an increasingly advanced heuristics generator), but it’s done at the “cost” of making those features accurately pointed at G[3].
[...]
Thus, inasmuch as we’re successful in setting up a training process that results in an AGI, we’ll end up with an agent that’s some approximation of a G-maximizing wrapper-mind.
This is where I step off the train. It is not true that the only (or even the most likely) way for creativity to arise is for that creativity to be directed towards the selection criterion or to point towards the intended goal. It is not true that the only way for useful creativity to be recognized is by us. Creativity can be recognized by the agent as useful for its own goals, because the agent is an active participant in shaping the course of training. For anything that the agent might currently want, learning creativity is instrumentally valuable, and the benefits of creative heuristic-generation should transfer well between doing well according to its own aims and doing well by the outer optimization process’ criteria. Just like the benefits of creative heuristic-generation transfer well between problem solving in the savannah, problem solving in the elementary classroom, and problem solving in the workplace, because there is common structure shared between them (i.e. the world is lawful). I expect that just like humans, the agent will be improving its heuristic-generator across all sorts of sub(goals) for all sorts of reasons, leading to very generalized machinery for problem-solving in the world.
To rephrase a bit: The heuristics generator will be reinforced more if it’s pointed at G, so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at G, because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can’t robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn’t reinforce the heuristics generator a lot, and doesn’t preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn’t cultivate the capabilities needed for AGI, and therefore doesn’t result in AGI.
No, I think this is wrong as I understand it (as is the similar content in the closing paragraphs). The form of this argument looks like:
X+Y produces more Z than X alone, and you need a lot of Z to create an AGI, so a process that creates an AGI will do so through X+Y.
with
X = the agent completes a diverse / adversarial / sophisticated training process requiring it to do well at generating heuristics
Y = the agent’s heuristic-generator is terminally pointed at G
Z = amount of total reinforcement accrued to the heuristic-generator
You need to claim something like “Y is required in order to produce sufficient Z for AGI”, not just that it produces additional Z. And I don’t buy that that’s the case. But also, I actually disagree with the premise that agents whose heuristic-generators are pointed merely instrumentally at G will have less reinforced/worse heuristics-generators than ones whose heuristic-generators are pointed terminally at G. IMO, learning the strategies that enable flexibly navigating the world is convergently useful and discoverable, in a way that is mostly orthogonal to whether or not the agent is pursuing the outer selection criterion.
Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself. That’s what I was hoping I was communicating with the 3 examples.
EDIT: I should add that I agree that, all else equal, the factors you listed in the section below are relevant to joint alignment + capability success:
In this, we can be more or less successful along various dimensions:
I don’t think of the chisel itself being “shaped around” the intended form, but rather, the chisel is a tool that is used to shape the statue so that the statue reflects that form
The dog example was helpful, thanks. Although I usually think in terms of training from scratch/a random initialization. Still: to train e. g. a paperclip-maximizer, you don’t have to start out reinforcing it for its paperclip-making ability, you might instead teach it a world-model and some basic skills first, etc. The reinforcement schedule should dynamically reflect the agent’s current capabilities, in a way, instead of being static!
There are some points I want to make here — primarily, that it’s a break from the pure blind greedy optimization algorithm I was discussing, if the outer optimizer is intelligent enough to take the agent’s current policy or internals into account. E. g., as the ST pointed out, human values and biases are inaccessible to the genome, so the reward circuitry is frozen, can’t dynamically respond in this fashion. Same for how ML models are currently trained most of the time.
But let’s focus on a more central point of disagreement for now.
Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself
Good point to highlight: I don’t understand how you expect it to work.
For capabilities to grow more advanced, they don’t need to be just reinforced. Marginally better performance needs to be marginally more reinforced, and the exploration policy needs to allow the agent to find said marginally better performance.
Consider the situation like the CoastRunners, except where the model is primarily reinforced for winning the race. Suppose that, somehow, it learns to do donuts before winning the race. It always does a donut, in every scenario, and sometimes it wins and gets reinforced, and do-a-donut gets reinforced as well, so it’s never unlearned. But its do-a-donut behavior either never gets more advanced, or only gets more advanced inasmuch as it serves winning! It’ll never learn to do more elaborate and artful donuts (or whatever it “values” about donuts); it’ll only learn to do shorter and more winning-compatible donuts.
Consider your MuZero example. Suppose that “if the model sends out its king unprotected, output 1” is the entirety of the reward function we’re fine-tuning it on. The policy never does that… So it never gets reinforced on anything at all, so it never gets better at e. g. winning the game!
There is a way around it via gradient-hacking, of course. The model can figure out what gives it reinforcement events, recognize when it’s made a unusually good move, then “game” the system by sending out a king unprotected, in order to reinforce the entire chain of computations that led to it, which would include it showing unusual creativity.
Or the CoastRunners model can figure out what it values about donuts, and strive to make ever-more-artful donuts correlated with winning a race (e. g., by trying harder to win if it’s unusually satisfied with the cool donut move it just executed), so that its ability to donut better gets reinforced.
Is that roughly what you have in mind?
But this is an incredibly advanced capability. It requires the model to be situationally aware, already recognizing that it’s being trained scenario-to-scenario. It requires it to be reflective, able to evaluate its behavior according to its values on its own, instead of just blindly executing the adaptations it already has. It requires it to have advanced meta-cognition in general, where it’s able to reason about goals, the instrumental value of self-improvement, about what “reinforcement” does, about what seems to be causing reinforcement events, et cetera.
We don’t get to that point for free. We’d need to do a whole lot of heuristics-generator-reinforcement before the model will be able to do any of that. And until the model is advanced enough to take over like this, the outer optimizer will only preferentially reinforce creativity that serves the values implicit in the outer optimizer’s implementation, and will optimize against deploying that creativity for the model’s own values (by preferentially reinforcing only the cases where the model prioritizes “outer-approved” values to its “inner-only” ones).
That said, I’m open to counter-examples showing that the model can learn some simple way to do this kind of gradient-hacking.
I’m… a bit unclear on the details of your GPT-7 example; it vaguely seems like a possible counter, but I think it’s just because in it, the model can kind-of rewrite its reinforcement schedule? (The more it populates the database with Navy Seal copypastas, the more its outputting the Navy Seal copypasta gets reinforced, in a feedback loop?) But that really is a weird setup, I think.
Although I usually think in terms of training from scratch/a random initialization. Still: to train e. g. a paperclip-maximizer, you don’t have to start out reinforcing it for its paperclip-making ability, you might instead teach it a world-model and some basic skills first, etc.
Fair enough. I tend to switch between thinking about training from scratch vs. continuing from a pretrained initialization vs. something else. Always involving a substantial portion where the model does autonomous learning, though.
The reinforcement schedule should dynamically reflect the agent’s current capabilities, in a way, instead of being static!
There are some points I want to make here — primarily, that it’s a break from the pure blind greedy optimization algorithm I was discussing, if the outer optimizer is intelligent enough to take the agent’s current policy or internals into account. E. g., as the ST pointed out, human values and biases are inaccessible to the genome, so the reward circuitry is frozen, can’t dynamically respond in this fashion. Same for how ML models are currently trained most of the time.
Yeah I agree that from the standpoint of the overseer trying to robustly align to their goal, as well as from the standpoint of the outer optimizer “trying” to find criterion-optimal policies, it would be best if they could do a sort of dynamic/interactive reinforcement that tracks the development of the agent’s capabilities through training. That’s an area of research that excites me. I do think it will be sorta difficult because of symbol grounding / information inaccessibility / ontology identification problems, but probably not hopelessly so.
Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself
Good point to highlight: I don’t understand how you expect it to work.
I think there might be a misunderstanding here. The bolded text was not meant to be a proposal about some way to boost capability or alignment. It was meant to be a generic description of causal pathways through which autonomous learning shapes behavior/cognition. Compare to something like “Reinforcement and selection of traits/genes does not just come from a species growing its absolute population size, it also comes from individual organisms exercising selection (like a bird choosing the most brightly ornamented mate, even though that trait is ~orthogonal to absolute population growth)”.
For capabilities to grow more advanced, they don’t need to be just reinforced. Marginally better performance needs to be marginally more reinforced, and the exploration policy needs to allow the agent to find said marginally better performance.
Reality automatically hits back against poor capabilities, giving the agent feedback for its strategies (and for marginal changes to strategies) that in fact did or did not have the consequences that the agent intended them to have. Because of that, I expect that the reward function does not need to do all that much sophisticated directing, provided that the architecture and training paradigm are in the right ballpark (which they’ll need to be in order to feasibily produce AGI at all). The lion’s share of useful bits contributing to the agent’s capability development will come from the agent’s interaction with reality anyways, not from the reward function’s handholding.
Is that roughly what you have in mind?
No, that’s not what I had in mind. The examples aren’t supposed to be examples where we’re plausibly gonna get an AGI, they’re supposed to be examples that showcase how the agent can exercise selection, even very extreme levels of selection, in a way that decouples from the outer objective.
In general I expect we’d mostly see super simple motifs like “an agent picks up a reward-correlated or reward-orthogonal decision-influence early on, and by default that circuit sticks around and continues to somewhat influence the agent’s behavior, which exercises ‘selection’ through the policy for the rest of training”. A much less sexy and sophisticated form of gradient hacking than what you thought of.
Reality automatically hits back against poor capabilities, giving the agent feedback for its strategies (and for marginal changes to strategies) that in fact did or did not have the consequences that the agent intended them to have
Okay, but how does reinforcement happen, here? The CoastRunner model tries to execute a cool donut by outputting a particular pattern of commands, it succeeds, and — how does that get reinforced, if that pattern doesn’t also contribute to the agent winning the race? Where does the reinforcement event come from?
In addition, that self-teaching pattern where it can “intend” some consequences before executing a strategy, and would then evaluate the consequences that strategy actually had, presumably to then update its strategy-generating function — that’s also a fairly advanced capability that’d only appear after a lot of heuristics-generator-reinforcement, I think.
In general I expect we’d mostly see super simple motifs like “an agent picks up a reward-correlated or reward-orthogonal decision-influence early on, and by default that circuit sticks around and continues to somewhat influence the agent’s behavior, which exercises ‘selection’ through the policy for the rest of training”.
That sounds like my example with a donut-making agent whose donut-making artistry never gets reinforced; that just does the donut of the same level of artistry every time.
I don’t see how it’d robustly stick around. As long as there’s some variance in the shape of donuts the agent makes, it’d only get reinforced for making shorter donuts (because that’s correlated with it winning the race faster), and the donuts would get smaller and smaller until it stops doing them altogether.
(It didn’t happen in the actual CoastRunners scenario because it didn’t reward the model for winning the race, it rewarded it for smashing into objects.)
Okay, but how does reinforcement happen, here? The CoastRunner model tries to execute a cool donut by outputting a particular pattern of commands, it succeeds, and — how does that get reinforced, if that pattern doesn’t also contribute to the agent winning the race? Where does the reinforcement event come from?
Are we talking about the normal case where the agent can collect powerup rewards in the lagoon, or an imagined variant where we remove those? In both cases some non-outer reinforcement comes from the positive feedback loop between the policy’s behavior and the environment’s response. Like, I’m imagining that there’s a circuit that outputs a leftward steering bias whenever it perceives the boat to be in the lagoon, which when triggered by entering the lagoon has the effect of making the boat steer leftward, which causes the boat to go in a circle, which puts the agent back somewhere in the lagoon, which causes the same circuit to trigger as it again recognizes that the boat is in the lagoon. In the case where we’re keeping the powerups, that is an additional component in the positive feedback loop where collecting the powerups creates rewards which (not necessarily immediately, if offline) strengthen the circuit that led to the rewards. The total effect of this positive feedback loop is the donut behavior reinforcing itself.
In addition, that self-teaching pattern where it can “intend” some consequences before executing a strategy, and would then evaluate the consequences that strategy actually had, presumably to then update its strategy-generating function — that’s also a fairly advanced capability that’d only appear after a lot of heuristics-generator-reinforcement, I think.
Interesting, I don’t think of it as that particularly advanced, assuming that the agent’s cognitive architecture is suitable for autonomous learning. Like, when a baby is hungry but sees his bottle, and he sends neural impulses from cortex down to his arm because he intends to reach towards the bottle, and then those impulses make his arm go in a somewhat crooked direction, so he updates on the feedback that reality just gave him about the mapping between cortical firing activity and limb control, such that next time around there’s a better match between his intended motion and his perceived motion; that sort of thing strikes me as exactly the pattern I’m describing. As the baby develops, it scaffolds up to more complex and abstract intentions, along with strategies to achieve them, but the pattern is basically the same. It does (or imagines) things with intention and uses the world (or a learned world model) to get rich feedback.
That sounds like my example with a donut-making agent whose donut-making artistry never gets reinforced; that just does the donut of the same level of artistry every time.
I’m not really sure what example you’re talking about here, or what the issue with this is.
I don’t see how it’d robustly stick around.
It’s a neural circuit that exists in the network weights. Unless you actively disconnect or overwrite it, it won’t go anywhere.
As long as there’s some variance in the shape of donuts the agent makes, it’d only get reinforced for making shorter donuts (because that’s correlated with it winning the race faster), and the donuts would get smaller and smaller until it stops doing them altogether.
Are you talking about the alternative version where there are no powerups in the lagoon?
I may have lost the thread of the discussion here. It sounds like what you’re asking is something like “If we don’t give rewards to that tendency at all, then won’t we gradually select away from it as time goes on and we approach convergence, even if the tendency starts off slightly biasing the training trajectories?” If that’s what you’re asking, then I would say that that is true in theory, but that there’s no such thing as convergence in the real world.
Are we talking about the normal case where the agent can collect powerup rewards in the lagoon, or an imagined variant where we remove those?
I meant the imagined variant where we’re rewarding the agent for winning the race, yeah, sorry for not clarifying. I mean the same variant in the example down this comment.
Right, I think there’s some disconnect in how we’re drawing the agent/reward circuitry boundary. This:
Like, when a baby is hungry but sees his bottle, and he sends neural impulses from cortex down to his arm because he intends to reach towards the bottle, and then those impulses make his arm go in a somewhat crooked direction, so he updates on the feedback that reality just gave him
On my model, that’s only possible because humans learn on-line, and this update is made by the reward circuitry, not by some separate mechanism that the reward circuitry instilled into the baby. (And this particular example may not even be done via minimizing divergence from WM predictions, but via something like this.)
I agree that such a mechanism would appear eventually, even if the agent isn’t trained on-line, especially in would-be-AGI autonomous agents who’d need to learn in-context. But it’s not there by default.
Like, I’m imagining that there’s a circuit that outputs a leftward steering bias whenever it perceives the boat to be in the lagoon, which when triggered by entering the lagoon has the effect of making the boat steer leftward, which causes the boat to go in a circle, which puts the agent back somewhere in the lagoon, which causes the same circuit to trigger as it again recognizes that the boat is in the lagoon
How does that induce an update to the model’s parameters, though? We feed the model the current game-state as an input, it runs a forward pass, outputs “steer leftward”, we feed it the new game-state, it outputs “steer leftward” again, etc. — but none of that changes its circuits? The update only happens after the model completes the race.
And yes, at that point the do-a-donut circuits would get reinforced too, but they wouldn’t be preferentially reinforced for better satisfying the model’s values. Suppose the model, by its values, wants to make particularly “artful” donuts. Whether it makes particularly bad or particularly good donuts, they’d get reinforced the same amount at the end of the race. So the model would never get better at donut artistry as evaluated by its own values. The do-a-donut circuit would persevere if the model always makes donuts, but it’ll stay in its stunted form. No?
Right, I think there’s some disconnect in how we’re drawing the agent/reward circuitry boundary.
On my model, that’s only possible because humans learn on-line, and this update is made by the reward circuitry, not by some separate mechanism that the reward circuitry instilled into the baby.
Oh, huh. Yes the thing you’re calling the “reward circuitry”, I would call the “reward function and value function”. When I talk about the outer optimization criterion or R, in an RL setting I am talking about the reward function, because that is the part of the “reward circuitry” whose contents we actually specify when we set up the optimization loop.
The value function is some learned function that looks at the agent’s mental state and computes outputs that it contributes to TD error calculation. TD errors are what determine the direction and strength with which circuitry gets updated from moment to moment. There needs to be a learned component to the updating process in order to do immediate/data-efficient/learned credit assignment over the mental state. (Would take a bit of space to explain this more satisfyingly. Steve has some good writing on the subject.)
That’s roughly my model of how RL works in animals, and how it will work in autonomous artificial agents. Even in an autonomous learning setup that only has prediction losses over observations and no reward, I would still expect the agent to develop something like intentions and something like updating pretty early on. The former as representations that assist it in predicting its future observations from its own computations/decisions, and the latter as a process to correct for divergences between its intentions and what actually happens[1].
How does that induce an update to the model’s parameters, though? We feed the model the current game-state as an input, it runs a forward pass, outputs “steer leftward”, we feed it the new game-state, it outputs “steer leftward” again, etc. — but none of that changes its circuits? The update only happens after the model completes the race.
And yes, at that point the do-a-donut circuits would get reinforced too, but they wouldn’t be preferentially reinforced for better satisfying the model’s values.
By itself, this behavior-level reinforcement does not necessarily lead to parameter updates. If the only time when parameters get updated is when reward is received (this would exclude bootstrapping methods like TD for instance), and the only reward is at the end of the race, then yeah I agree, there’s no preferential updating.
But behavior-level reinforcement definitely changes the distribution of experiences that the agent collects, and in autonomous learning, the parameter updates that the outer optimizer makes depend on the experiences that the agent collects[2]. So depending on the setup, I expect that this sort of extreme positive feedback loop may either effectively freeze the parameters around their current values, or else skew them based on the skewed distribution of experiences collected, which may even lead to more behavior-level reinforcement and so on.
Suppose the model, by its values, wants to make particularly “artful” donuts. Whether it makes particularly bad or particularly good donuts, they’d get reinforced the same amount at the end of the race. So the model would never get better at donut artistry as evaluated by its own values. The do-a-donut circuit would persevere if the model always makes donuts, but it’ll stay in its stunted form. No?
Not sure off the top of my head. Let’s see.
If the agent “wants” to make artful donuts, that entails there being circuits in the agent that bid for actions on the basis of some “donut artfulness”-related representations it has. Those circuits push the policy to make decisions on the basis of donut artfulness, which causes the policy to try to preferentially perform more-artful donut movements when considered, and maybe also suppress less-artful donut movements.
If the policy network is recurrent, or if it uses attention across time steps, or if it has some other form of memory, then it is possible for it to “practice” its donuts within an episode. This would entail some form of learning that uses activations rather than weight changes, which has been observed to happen in these memoryful architectures, sometimes without any specific losses or other additions to support it (like in-context learning). By the end, the agent has done a bunch of marginally-more-artful donuts, or its final few donuts are marginally more artful (if actions temporally closer to the reward are more heavily reinforced), or it donut artfulness is more consistent.
Now, if the agent is always doing donuts (like, it never ever breaks out of that feedback loop), and we’re in the setting where the only way to get parameter updates is upon receiving a reward, then no the agent will never get better across episodes. But if it is not always doing donuts, then it can head to the end of the race after it completes this “practice”. That should differentially reinforce the “practiced” more-artful donuts over less-artful donuts, right?
(To be clear, I don’t think that the real CoastRunners boat agent was nearly sophisticated enough to do this. But neither was it sophisticated enough to “want” to do artful donuts, so I feel like it’s fair to consider.)
Is there something specific you wanted to probe with this example? Again, I don’t quite know how I should be relating this example to the rest of what we’ve been talking about.
Is there something specific you wanted to probe with this example?
On my end, the argument structure goes as follows (going from claims I’m making to sub-claims that try to justify them):
AGI-level training setups attempt to build models primarily concerned with optimizing hard for some context-independent proxy of “outer-approved” values.
To get to AGI, we need a training setup that incentivizes heuristics generators, and systemically improves these generators’ capabilities.
To do that, we need a setup that a) explores enough to find marginally better heuristics-generator performance, and b) preferentially reinforces marginally better heuristics-generator performance over stagnant or worse performance.
To do that, we need some metric for “better performance”. One such metric is the outer optimizer’s reward function. Another such metric would be the model’s own values.
For the model to improve its performance across training episodes according to its own values (in ways that are orthogonal/opposed to outer-approved values), it needs to either:
Do advanced gradient-hacking, i. e. exploit the reinforcement machinery for its own purposes. That itself requires advanced general capabilities, though, so Catch-22.
Learn in-context, in a way that’s competitive with learning across episodes, such that its capabilities across only-inner-approved metrics don’t grow dramatically slower than along outer-approved metrics.
I argue that 5b is also a Catch-22, in that it requires a level of sophistication that’ll only appear after the heuristics generator has already become very developed.
So if a model can’t quickly learn to learn in-context, then for most of its training, the sophistication of its features can only improve in ways correlated with performance improvements on outer-approved values. Since “features” include the heuristics generator, the only way for the heuristics generator to grow more advanced would be by becoming better at achieving outer-approved values, so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values.
We’re now trying to agree on whether models can quickly learn some machinery for comprehensively improving in-context along metrics that are orthogonal/opposed to the “outer-approved” values.
If no, then the heuristics generator will tend to be shaped to align with outer-approved values, and AGI-capable training setups will result in a wrapper-mind-approximation.
If yes, then there would be no strong pressure to point the heuristics generator in a particular abstract direction across contexts, and we would not get a wrapper-mind-approximation.
I think that it’s a crux for me, in that if I’m unconvinced of (6), I’d have to significantly re-evaluate my model of value formation, likely in favour of mainline shard-theory views.
Okay, onto object-level:
The value function is some learned function that looks at the agent’s mental state and computes outputs that it contributes to TD error calculation
Very interesting. I really need to read Steve’s sequence. As I don’t have a good model of how that works yet (or how it’d be implemented in a realistic AGI setup), it’s hard for me to evaluate how that’d impact my view. I’ll read the linked post and come back to this. Would also welcome links to more resources on that.
If the policy network is recurrent, or if it uses attention across time steps, or if it has some other form of memory, then it is possible for it to “practice” its donuts within an episode. This would entail a form of leaning that uses activations rather than weight changes, like in-context learning, which has been observed to happen in these memoryful architectures, sometimes without any specific losses or other additions to support it
Any examples, off the top of your head?
Potential concerns (assumes no TD learning):
Even if it’s possible to easily learn to improve in-context, would the tendency or ability to do that be preferentially reinforced, if that itself is not outer-value-approved? E. g., suppose the model develops a predilection towards making donuts, and some arbitrary metric of how an “artful” donut looks like. How would it learn to practice its donut-making ability, as you described it? Why would that be reinforced by the outer optimizer?
I. e., I’d expect it to go as follows by default: The donut-valuing circuit sees that it’s in a lagoon, bids for “steer left” several times, in a way that strikes it as artful. The boat completes a loop. The donut-valuing circuit sees that it’s in a lagoon again, and bids for another loop, which it also tries to make artful...
But it doesn’t learn. It doesn’t integrate the experience from the first donut it just made to make its second donut a bit better. That’d require specialized in-context-learning machinery, and—
How would you predict it to be developed?
I think this assumes, again, some general-purpose self-improvement machinery, where the model generally gets better at anything it tries. Maybe it’s your carry-over assumption from TD learning? But if there’s no TD learning, this should be difficult to learn from scratch.
Would that in-context learning actually be competitive with in-context learning that’s also directly bolstered by preferential reinforcement from the outer optimizer? E. g., the model values both donut-making and race-winning. In a given episode, it makes a few donuts, then tries to win the race faster. The outer reinforcement system updates its weights. As the result...
Its ability to make donuts improves by ΔQdonut = (quality of its final donut) minus (quality of its first donut). Its ability to win races improves by ΔQrace = (quality of maneuvering at the start of the race) minus (quality of maneuvering at the end of the race). Those values should be at best roughly equal[1].
But in addition to in-context learning, the model’s ability to win races is also reinforced episode-to-episode! Whenever it manages to win a race unusually fast, that preferentially gets reinforced more; there’s cross-episode selection for better race-winning strategies. By comparison, there’s no such thing for donut artistry: the model only gets better at it by in-episode practice, not by selecting from cross-episode variance.
Also, here’s a formalization for the first concern, i. e. that the model’s in-context learning won’t be improved:
If ΔQrace for a particular episode is unusually high, i. e. the model showed unusual ability to get better at winning races in-context, the in-context learning machinery gets preferentially updated to be more likely to show this performance (than if it showed subpar or ordinary in-context learning).
By comparison, if ΔQdonut is unusually high, i. e. the model got better at getting better at donut-making, that doesn’t get reinforced more than if it got worse at getting better at donut-making (i. e., than if ΔQdonut this episode was below average).
Although I’d expect improvements on maneuvering to be higher than on donut-making, because I’d expect the in-context learning machinery for race-winning to be more advanced than the one for donut-making (as the outer optimizer would preferentially reinforce it). See the first concern.
AGI-level training setups attempt to build models primarily concerned with optimizing hard for some context-independent proxy of “outer-approved” values.
To get to AGI, we need a training setup that incentivizes heuristics generators, and systemically improves these generators’ capabilities.
I think 2 is probably true to a certain extent. But maybe not to the same extent that you are imagining. Like, I think that the primary thing that will drive the developing agent’s heuristic-generation becoming better and better is its interaction with a rich world where it can try out many different kinds of physical and mental strategies for achieving different (sub)goals. So you need to provide a rich world where there are many possible natural (sub)goals to pursue and many possible ways to try to pursue them (unlike CoastRunners, where there aren’t), and you need to architect the agent so that it is generally goal-directed, and it would probably be helpful to even do the equivalent of “putting the AI in school” / “having the AI read books” to give it a little kickstart. But that’s about all I’m imagining. I am not imagining that you need to construct your training environment to specifically incentivize all of the different facets of heuristic-generation. As the agent pursues the goals that it pursues in a complex world, it is incentivized to learn because learning is what helps it achieve its goals better.
1 seems probably false to me. If you mean that AGI-level setups, in order to work, need to be primarily concerned with that, then I definitely disagree. Like, imagine that in order to build up the AI’s cognition & skills from some baseline, you teach it that every “training day” it will experience repeated trials of some novel task, and that for every trial it completes, it’ll get some object-level thing it likes (for rats this might be sugar water, for kids this might be a new toy, for adults this might be money). The different tasks can all have different success criteria and they don’t have to have anything to do with human value proxies for this to work, right?
If you just mean that when people build AGI-level training setups, “optimizing hard for some context-independent proxy of ‘outer-approved’ values” that is what those people will have in mind in their designs, then I dunno. I don’t really feel justified in making an assumption about what considerations they’ll have in mind.
For the model to improve its performance across training episodes according to its own values (in ways that are orthogonal/opposed to outer-approved values), it needs to either:
A few points.
I think that training setups that do not facilitate something like bootstrapping (i.e. modifying parameters even in some cases where there was no reward), are not competitive and will not produce AGIs. Think about how awful and slow it would be, trying to learn how to do any new and complex task, if the only time you actually learned anything was in the extremely rare instance where you happen to bumble your way through to success. No learning from mistakes, no learning incrementally from checkpoints or subgoals you set, no learning from mere exploration. I think that this sort of “learning on your own” is intimately tied with autonomy. But that is also exactly what enables you to reinforce & improve yourself in directions other than toward the outer optimization criterion.
To get an AGI-level model that pursues something other than the outer optimization criterion (what I was arguing at the top of the thread → that we don’t get an R-pursuer) under some setup, it does not need to be true that the model early in training improves its performance according to its own values in ways that are orthogonal/opposed to the outer-approved values. Think about some of the other conditions where we can get a non-R-pursuer:
Maybe the model doesn’t have any context-agnostic “values” (not even “values” about pursuing R) until after it has some decent heuristic-generation machinery built up.
OR (the most likely scenario, IMO) maybe the outer objective performance is in fact correlated with the model’s ability to perform well according to its own values. For instance, the training process is teaching the model better general purpose heuristics-generation machinery, which will also make it better at pursuing its own values (because that machinery is generally useful no matter what your goals are).
So if a model can’t quickly learn to learn in-context, then for most of its training, the sophistication of its features can only improve in ways correlated with performance improvements on outer-approved values. Since “features” include the heuristics generator, the only way for the heuristics generator to grow more advanced would be by becoming better at achieving outer-approved values, so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values.
I don’t know why we are talking about “outer-approved values” G here. The influence of those outer-approved values on the AI is screened off by the concrete optimization criterion R that the designers of the training process chose when they wrote the training loop. Aren’t we talking about R-pursuers? (or R-pursuers that are wrapper-minds? I forget if you are still looking to make the case for wrapper-mind structure or merely R-pursuing behavior.)
But also this bit
so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values
does not follow from the rest of the argument. Why can’t the heuristics-generator be shaped to be a good general purpose heuristic-generator, one that the agent uses to perform well on the outer optimization criteria? Making your general-purpose heuristic-generator better is something that would always be reinforced, right? There’s no need for the heuristic-generator to care (or even know) about the outer criterion at all, if the agent is using the heuristic-generator as a flexible tool for accomplishing things in episodes. Like, why not have separation of concerns, where the heuristic-generator is a generic subroutine that takes in a (sub)goal, and there’s some other component in the agent that knows what the outer objective is?
It’s not like thinking about the context-independent goal of “win the race” will help the agent once it’s already figured out that the way to “win the race” in this environment is to first “build a fast boat”, and it now needs to solve the subproblem of “build a fast boat”. If anything, always being forced to think about the context-independent criterion is actively harmful, distracting the agent from the information that is actually decision-relevant to the subtask at hand. It also seems like it’d be hard to make a heuristic-generator that is narrowly specialized for “winning the race”, and not one that the agent can plug basically arbitrary (sub)goals into, because you’re throwing the agent into super diverse environments where what it takes to “win the race” is changing dramatically.
We’re now trying to agree on whether models can quickly learn some machinery for comprehensively improving in-context along metrics that are orthogonal/opposed to the “outer-approved” values.
For the agent to adopt values that differ from those that differ from pursuing R/G (once again, I don’t think they need to be orthogonal/opposed to R/G, because aren’t you defending the claim that the agent will value R/G, not that it will merely value some correlate of it? I already believe that the agent will probably value some correlate), this machinery doesn’t need to be learned “quickly” in any absolute sense, it just needs to outpace the outer optimizer’s process of instilling its objective into context-independent values in the agent. But note that the agent doesn’t start off having context-independent values; having values like those in the first place is something I don’t expect to happen until relatively “late” in cognitive development, and at that point I’m not sure “who gets there first”, so to speak.
If yes, then there would be no strong pressure to point the heuristics generator in a particular abstract direction across contexts, and we would not get a wrapper-mind-approximation.
Like I said above, I think that constraining the heuristic-generator to always point at some specific abstract direction across contexts is at least unnecessary for the agent to do well and become smart (because it can factor out that abstract direction and input it when needed as the heuristic-generator’s current subgoal, and because improvements to the heuristic-generator are general-purpose), and possibly actively harmful for its usefulness to the agent.
Would also welcome links to more resources on that
This post from Steve and its dependencies is probably the best conceptual walkthrough of an example that I’ve seen. Sutton & Barto have an RL textbook with lots of good mathematical content on this.
Even if it’s possible to easily learn to improve in-context, would the tendency or ability to do that be preferentially reinforced, if that itself is not outer-value-approved? E. g., suppose the model develops a predilection towards making donuts, and some arbitrary metric of how an “artful” donut looks like. How would it learn to practice its donut-making ability, as you described it? Why would that be reinforced by the outer optimizer?
This doesn’t apply to the CoastRunners example because we are only doing rewards & weight updates at the end of the episode, but in other contexts (say, where there are multiple trials in a row, without “resets”) it can learn to practice the thing that gets rewards, and build a generalized skill around practicing, one that carries across subgoals.
Meta-level comment: I think you’re focused on what the likely training trajectories are for this particular CoastRunners example, and I am focused on what the possible training trajectories are, given the restrictions in place. I can’t tell a story about likely gradient hacking[1] there because the mechanisms that would exist in an AGI-compatible training setup that would make gradient hacking plausible have been artificially removed. The preconditions of the scenario make me think “How in the heck did we get to this point in training?”: the agent is somehow so cognitively-naive that it doesn’t have any concept of learning from trial-and-error, but it’s simultaneously so cognitively sophisticated that it already has a concept “doing a donut” and of what makes a donut “artful” and a desire around making its donuts continually more artful.
Using “gradient hacking” as a shorthand for circuits that are opposed/orthogonal/merely correlated with the outer objective to durably reinforce themselves.
I think that training setups that do not facilitate something like bootstrapping (i.e. modifying parameters even in some cases where there was no reward), are not competitive and will not produce AGIs
Yeah, I see that’s one of the main points of disconnect between our models. Not in the sense that I necessarily disagree, in the sense that I wasn’t familiar with this factor. We probably aren’t going to resolve this conclusively until I get around to reading the TD stuff (which I plan to do shortly).
Thanks for the links!
Maybe the model doesn’t have any context-agnostic “values” (not even “values” about pursuing R) until after it has some decent heuristic-generation machinery built up.
What’s it using the heuristics generator for, then? It’s a tool for figuring out how to pursue a goal in a context you’re unfamiliar with. But if you have no context-independent goals, you can’t define a goal over a context you’re unfamiliar with, so you don’t need the heuristics generator to begin with.
OR (the most likely scenario, IMO) maybe the outer objective performance is in fact correlated with the model’s ability to perform well according to its own values. For instance, the training process is teaching the model better general purpose heuristics-generation machinery, which will also make it better at pursuing its own values (because that machinery is generally useful no matter what your goals are).
Absolutely, I expect that to be the primary reason for deceptive alignment — once the model is smart enough for it.
But in this case, I argue that the heuristics generator will only be reinforced if its activity results in better performance along an outer-approved metric, which will only happen if it’s outputting heuristics useful for the outer-approved metric — which, in turn, will only happen if the model uses the heuristics generator to generate heuristics for an outer-approved value.
I’m not arguing that the heuristics generator will be specialized; I’m arguing that its improvements will be entangled with how it’s used.
E. g., two training episodes: in one the model asks for better heuristics for winning the race, in the other it asks for better donut-making heuristics.
In the former case, the heuristics generator will be reinforced, together with the model’s tendency to ask it for such heuristics.
In the latter, it wouldn’t be improved, nor would the tendency to ask it for this be reinforced.
Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.
(Or, rather, that the “command structure” around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to “build a boat” before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I’m not saying it’d be unable to use the heuristics generator flexibly.)
aren’t you defending the claim that the agent will value R/G, not that it will merely value some correlate of it?
Ehh, not exactly. I’m defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on; and that in a hypothetical “idealized” training setup, it’d care about G precisely. When I say things like “the heuristics generator will be asked for race-winning heuristics”, I really mean “the heuristics generator will be asked for heuristics that the model ultimately intends to use for a goal that is a close correlate of winning the race”, but that’s a mouthful.
Basically, I think there are two forces there:
What are the ultimate goals the heuristics generator is used for pursuing.
How powerful the heuristics generator is.
And the more powerful it is, the more tails come apart — the closer the goal it’s used for needs to be to G, for the agent’s performance on G to not degrade as the heuristics generator’s power grows (because the model starts being able to optimize for G-proxy so hard it decouples from G). So, until the model learns deceptive alignment, I’d expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to-G to counterbalance it, etc.
And so in the situation where the outer optimizer is the only source of reinforcement, we’d have the heuristics generator either:
Stagnate at some “power level” (if the model adamantly refuses to explore towards caring more about G).
Become gradually more and more pointed at G (until it becomes situationally aware and hacks out, obviously — which, outside idealized setups, will surely happen well before it’s actually pointed at G directly).
What’s it using the heuristics generator for, then? It’s a tool for figuring out how to pursue a goal in a context you’re unfamiliar with. But if you have no context-independent goals, you can’t define a goal over a context you’re unfamiliar with, so you don’t need the heuristics generator to begin with.
Why can’t you? The activations from observations coming in from the environment and from the agent’s internal state will activate some contextual decision-influences in the agent’s mind. Situational unfamiliarity does not mean its mind goes blank, any more than an OOD prompt makes GPT’s mind go blank. The agent is gonna think something when it wakes up in an environment, and that something will determine how and when the agent will call upon the heuristic-generator. Maybe it first queries it with a subgoal of “acquire information about my action space” or something, I dunno.
The agent that has a context-independent goal of “win the race” is in a similar predicament: it has no way of knowing a priori what “winning the race” requires or consists of in this unfamiliar environment (neither does its heuristic-generator), no way to ground this floating motivational pointer concretely. It’s gotta try stuff out and see what this environment actually rewards, just like everybody else. The agent could have a preexisting desire to pursue whatever “winning the race” looked like in past experiences. But I thought the whole point of this randomization/diversity business was to force the agent to latch onto “win the race” as an exclusive aim and not onto its common correlates, by thrusting the agent into an unfamiliar context each time around. If so, then previous correlates shouldn’t be reliable correlates anymore in this new context, right? Or else it can just learn to care about those rather than the goal you intended.
So I don’t see how the agent with a context-independent goal has an advantage in this setup when plopped down into an unfamiliar environment.
I’m not arguing that the heuristics generator will be specialized; I’m arguing that its improvements will be entangled with how it’s used.
I agree with this.
Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.
Why? I was imagining that the agent may prompt the heuristic-generator at multiple points within a single episode, inputting whatever subgoal it currently needs to generate heuristics for. If the agent is being put in super diverse environments, then these subgoals will be everything under the sun, so the heuristic-generator will have been prompted for lots of things. And if the agent is only being put in a narrow distribution of environments, then how is the heuristic-generator supposed to learn general-purpose heuristic-generation?
(Or, rather, that the “command structure” around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to “build a boat” before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I’m not saying it’d be unable to use the heuristics generator flexibly.)
Can there be additional layers of “command structure” on top of that? Like, can the agent have arrived at the “reasoning from what will help it win the race” thought by reasoning from something else? (Or is this a fixed part of the architecture?) If not, then won’t this have the problem that for a long time, the agent will be terrible at reasoning about what will help it win the race (especially in new environments), which means that starting with that will be a worse-performing strategy than starting with something else (like random exploration etc.)? And then that will disincentivize making this the first/outermost/unconditional function call? So then the agent learns not to unconditionally start with reasoning from that point, and instead to only sometimes reason from that point, conditional on context?
I’m defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on
Hmm. I am skeptical of that claim, though maybe less so depending on what exactly you mean[1].
Consider a different claim that seems mechanistically analogous to me:
The mean absolute fitness of a population tends to increase over the course of natural selection
Yes it is true that [differential reinforcement | relative fitness] is a selection pressure acting on the makeup of [things cared about | traits] across the [circuits | individuals] within a [agent | population], but AFAICT it is not true that the [agent | population] increases in [reward performance | absolute fitness] over the course of continual selection pressure.
So, until the model learns deceptive alignment, I’d expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to- G to counterbalance it, etc.
Yeah that may be a part of where our mental models differ. I don’t expect the balance of how much power the agent has over training vs. how close its goals are to the outer criterion to go in lockstep. I see “deceptive alignment” as part of a smooth continuum of agent-induced selection that can decouple the agent’s concerns from the optimization process’ criteria, with “the agent’s exploration is broken” as a label for the cognitively less sophisticated end of that continuum, and “deceptive alignment” as a label for the cognitively more sophisticated end of that continuum. And I think that that even the not-explicitly-intended pressures at the unsophisticated end of that continuum are quite strong, enough to make “the agent tends to be shaped to care about increasingly closer correlates of G” abstraction leak hard.
Like, for a given training run, as the training run progresses, the agent will be shaped to care about closer and closer correlates of G? (Just closer on average? Monotonically closer? What about converging at some non-G correlate?) Or like, among a bunch of training runs, as the training runs progress, the closeness of the [[maximally close to G] correlate that any agent cares about] to G keeps increasing?
Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?
(Also, do note if I’m failing to answer some important question you pose. I’m trying to condense responses and don’t answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)
Can there be additional layers of “command structure” on top of that? Like, can the agent have arrived at the “reasoning from what will help it win the race” thought by reasoning from something else?
Mm, yes, in a certain sense. Further refining: “over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of G regardless of the environment they’re in”. I do imagine that a given agent may orient themselves towards differentG-correlates depending on what specific stimuli they’ve been exposed to this episode/what context they’ve started out in. But I argue that it’ll tend to be a G-correlate, and that the average closeness of G-correlates across all contexts will tend to decrease as training goes on.
E. g., suppose the agent is trained on a large set of different games, and the intended G is to teach it to value winning. I argue that, if we successfully teach the agent autonomy (i. e., it wouldn’t just be a static bundle of heuristics, but it’d have a heuristics generator that’d allow it to adapt even to OOD games), there’d be some structure inside it which:
Analyses the game it’s in[1] and spits out some primary goal[2] it’s meant to achieve in it,
… and then all prompting of the heuristics-generator is downstream of that primary goal/in service to it,
… and that environment-specific goal is always a close correlate of G, such that pursuing it in this environment correlates with promoting G/would be highly reinforced by the outer optimizer[3],
… and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to G.
I see “deceptive alignment” as part of a smooth continuum of agent-induced selection that can decouple the agent’s concerns from the optimization process’ criteria, with “the agent’s exploration is broken” as a label for the cognitively less sophisticated end of that continuum
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
The agent’s goals can decouple all it wants, but it’ll only grow more advanced if it growing more advanced is preferentially reinforced by the outer optimizer. And that’ll only happen if it being more advanced is correlated with better performance on outer-approved metrics.
Which will only happen if it uses its growing advancedness to do better at the outer-approved metrics.
Which can happen either via deceptive alignment, or by it actually caring about the outer-approved metrics more (= caring about a closer correlate of the outer-approved metrics (= changing its “command structure” such that it tends to recover environment-specific primary goals that are a closer correlate of the outer-approved metrics in any given environment)).
And if it can’t yet do deceptive alignment, and its exploration policy is such that it just never explores “caring about a closer correlate of the outer-approved metrics”, its features never grow more advanced.
Which may be done by active actions too, as you suggested — this process might start with the agent setting “acquire information about my environment” as its first (temporary) goal, even before it derives its “terminal” goal.
Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?
Maybe? I dunno. It feels like the model that you are arguing for is qualitatively pretty different than the one I thought you were at the top of the thread (this might be my fault for misinterpreting the OP):
You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
You are arguing that in the limit, what the agent cares about will either tend to correlate more and more closely to outer performance or “peter out” (from our perspective) at some fixed level of sophistication, not arguing that in the limit, what the agent cares about will unconditionally tend to correlate more and more closely to outer performance
You are arguing that agents of growing sophistication will increasingly tend to pursue some goal that’s a natural interpretation of the intent of R, not arguing the agents of growing sophistication will increasingly tend to pursue R itself (i.e. making decisions on the basis of R, even where R and the intended goal come apart)
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
I don’t think I disagree all that much with what’s stated above. Somewhat skeptical most of the claims, but I could definitely be convinced.
(Also, do note if I’m failing to answer some important question you pose. I’m trying to condense responses and don’t answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
That’s fine. Again, I don’t think the setups where the end of episode rewards are only source of reinforcement are setups where the agent’s cognition can grow relevantly sophisticated in any case, regardless of decoupling.
Mm, yes, in a certain sense. Further refining: “over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of
G
regardless of the environment they’re in”. I do imagine that a given agent may orient themselves towards different
G
-correlates depending on what specific stimuli they’ve been exposed to this episode/what context they’ve started out in. But I argue that it’ll tend to be a
G
-correlate, and that the average closeness of
G
-correlates across all contexts will tend to [increase] as training goes on.
Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal? Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
… and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to
G
.
AFAICT it will spit out the sorts of goals that it has been historically reinforced for spitting out in relevantly-similar environments, but there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
I think (1) we probably won’t get sophisticated autonomous cognition within the kind of setup I think you’re imagining, regardless of coupling (2) knowing that the agent’s cognition won’t grow sophisticated in training-orthogonal ways seems kinda useful if we could do it, come to think of it.
And if it can’t yet do deceptive alignment, and its exploration policy is such that it just never explores “caring about a closer correlate of the outer-approved metrics”, its features never grow more advanced.
And so it stagnates and doesn’t go AGI.
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication. So I don’t see why we should expect that the outer optimizer will asymptotically succeed at instilling the goal. In order to do that, it needs to fully build in the right cognition before the agent reaches a level of sophistication where, in the same way as RL runs early on can “effectively stop exploring” and that locks in the current policy, RL runs later on (at the point where the agent is advanced in the way you describe) can “effectively stop directing its in-context learning (or whatever other mechanism you’re saying would allow it to continue growing in advancedness without actually caring about the outer metrics more) at the intended goal” and that locks in its not-quite-correct goal. To say that that won’t happen, that it will always either lock itself in before this point or end up aligned to a (very close correlate of) G, you need to make some very specific claims about the empirical balance of selection.
You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
Yeah, that’s a legitimate difference from my initial position: wasn’t considering alternate setups like this when I wrote the post.
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
(1) Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal?
(2) Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
Consider an agent that’s been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What’s likely happening, internally?
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G.
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
As to whether it’s motivated to pursue the G-correlate because it’s a G-correlate — to answer that, we need to speculate on the internals of the “goal generator”. If it reliably spits out local G-correlates, even in environments it never saw before, doesn’t that imply that it has a representation of a context-independent correlate of G, which it uses as a starting point for deriving local goals?
If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would’ve been able to just memorize a goal for every environment it’s seen, and this procedure just de-compresses them.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Well, you do address this:
there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
… I don’t see a meaningful difference, here. There’s some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local G-correlate. What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
Or, perhaps a better question to ask is, what are some examples of these “decision-relevant factors in the environment”?
E. g., in the games example, I imagine something like:
The agent is exposed to the new environment; a multiplayer FPS, say.
It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
As it’s doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
Then it recognizes some factors in that abstract representation, goes “in environments like this, I must behave like this”, and “behave like this” is some efficient strategy for scoring the highest.
Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS’ actual mechanics.
Do you have something significantly different in mind?
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication
I still don’t see it. I imagine “deceptive alignment”, here, to mean something like:
“The agent knows G, and that scoring well at G reinforces its cognition, but it doesn’t care about G. Instead, it cares about some V. Whenever it notices its capabilities improve, it reasons that this’ll make it better at achieving V, so it attempts to do better at G because it wants the outer optimizer to preferentially reinforce said capabilities improvement.”
This lets it decouple its capabilities growth from G-caring: its reasoning starts from V, and only features G as an instrumental goal.
But what’s the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on G?
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
I don’t think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent’s behavioral capabilities (the actual business logic that carries out stuff like “recall the win conditions from relevantly-similar environments” and “do deductive reasoning” and “don’t die”), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent’s influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it’s already really good at fooling the discriminator on. This is something that happens all the time, under the label of “mode collapse”.
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent “wants” and what it doesn’t get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent “wants” and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it’ll always happen in autonomous learning setups.
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what “winning” means is different in each environment. The “goal generator” function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent’s world model and produces contextually-relevant action recommendations (like “take such-and-such immediate action”, or “set such-and-such as the current goal-image”), with this mapping having been learned from past reward events and self-supervised learning.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G .
Not hard-coded heuristics. Heuristics learned through experience. I don’t understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of “win the game” out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be “stand still for 30 seconds”, or “gather all the guns into a pile and light it on fire”? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it’s not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Depending on what you mean by OOD, I’m actually not sure if the sort of goal-generator you’re describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we’re choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
When I say “decision-relevant factors in the environment” I mean something like seeing that you’re in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other “team”. Not sure what “context-independent correlate of G” is. Was that my phrase or yours? 🤔
Do you have something significantly different in mind?
Nah that’s pretty similar to what I had in mind.
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an “expert” mode and a “novice” mode. The reward function in “expert” mode only gives rewards for winning, while in “novice” mode it also gives small rewards each turn based on material balance (to encourage “fair” play for new learners). Early in training the agent rapidly finds that there’s a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the “novice” condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the “expert” condition.
Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to “steer” conversions towards English whenever possible.
Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.
I think it’s the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.
Fair point. I’m more used to thinking in terms of SSL, not RL, so I sometimes forget to account for the exploration policy. (Although now I’m tempted to say that any AGI-causing exploration policy would need to be fairly curious (to, e. g., hit upon weird strategies like “invent technology”), so it would tend to discover such opportunities more often than not.)
But even if there aren’t always gradients towards maximallyR—promoting behavior, why would—
—there be gradients towards behavior that decreases performance on R or is orthogonal to R, as you seem to imply here? Why would that kind of cognition be reinforced?
As we’re talking about building autonomous agents, I’m generally imagining that training includes some substantial part where the agent is autonomously making choices that have consequences on what training data/feedback it gets afterwards. (I don’t particularly care if this is “RL” or “online SL” or “iterated chain-of-thought distillation” or something else.) A smart agent in the real world must be highly selective about the manner in which it explores, because most ways of exploring don’t lead anywhere fruitful (wandering around in a giant desert) or lead to dead ends (walking off a cliff).
There need not be outer gradients towards that behavior. Two things interact to determine what feedback/gradients are actually produced during training:
The selection criterion
The agent and its choices/computations
Backpropagation kinda weakly has this feature, because we take the derivative of the function at the argument of the function, which means that if the model’s computational graph has a branch, we only calculate gradients based on the branch that the model actually went down for the batch example(s). RL methods naturally have this feature, as the policy determines the trajectories which determine the empirical returns which determine the updates. Chain-of-thought training methods should have this feature too, because presumably the network decides exactly what chain-of-thought it produces, which determines what chains-of-thought are available for feedback.
“The agent not exploring in some particular way” is one of many possible examples of how the effect of 1&2 can be radically different from the theoretical effect of 1 alone. These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.
Can you provide a short concrete example, to check that I’m picturing this right?
Sure thing. Three concrete examples, in order of increasing weirdness:
Early in training the CoastRunners boat accidentally does a donut in the lagoon. That makes it score points and get a reward. That reinforces the donut behavior. That prevents it from reaching the finish line with any regularity, which selects against game-completing behavior.
We take a pretrained MuZero chess policy and try to tune it with a reward function that outputs 1 whenever the model sends its king out unprotected directly into the line of enemy pawns and 0 otherwise. But our “selection” based on this reward function is ineffectual even when using MuZero’s advanced model-based RL algorithms. This is because the sampled rollouts guided by that policy never reach the rewarded state (notice that even without the bias from the policy, any attempted remedy will have to deal with hitting on a narrow part of the exponentially large space of rollouts) so the value function can’t update on it. This is because the policy is steering away from the precursors of that state. This is because that policy previously learned to select actions that protect the king and actions that keep it far away from the center of the board.
OpenAI uses a galaxy-brained form of chain-of-thought tuning to train GPT-7, wherein they have the model first annotate a training datapoint with contextually-relevant information that it retrieves from a read-write database and then stores the generated outputs back into the database. Because of the ordering in the training datapoints, the model early on learns a tendency that causes it to more frequently retrieve the Navy Seal copypasta. This causes the model to progressively learn to ignore the input it is annotating and biases it towards generating more Navy Seal copypasta-like outputs. This selects against all other patterns of behavior; GPT-7 is very clever at carrying out its desires, so it doesn’t unlearn the behavior even if you give it an explicit instruction like “do not use any copypasta” (maybe it understands perfectly well what you mean but instead adds text like “<|endoftext|> # Navy Seal Copypasta”) or if you add a filter to check discard outputs that contain the world “Navy”. The model’s learned tendencies chain into themselves across computational steps and reinforce themselves into an unintended-by-us fixed point.
Thanks!
Okay, suppose we have a “chisel” that’s more-or-less correctly shaped around some goal G that’s easy to describe in terms of natural abstractions. In CoastRunners, it would be “win the race”[1]; with MuZero, “win the game”; with GPT-N, something like “infer the current scenario and simulate it” or “pretend to be this person”. I’d like to clarify that this is what I meant by R — I didn’t mean that in the limit of perfect training, agents would become wireheads, I meant they’d be correctly aligned to the natural goal G implied by the reinforcement schedule.
The “easiness of description” of G in terms of natural abstractions is an important variable. Some reinforcement schedules can be very incoherent, e. g. rewarding winning the race in some scenarios and punishing it in others, purely based on the presence/absence of some random features in each scenario. In this case, the shortest description of the reinforcement schedule is just “the reinforcement function itself” — that would be the implied G.
It’s not completely unrealistic, either — the human reward circuitry is varied enough that hedonism is a not-too-terrible description of the implied goal. But it’s not a central example in my mind. Inasmuch as there’s some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals G1∧G2∧...∧Gn implicit in the reinforcement schedule.
Now, to get to AGI, we need autonomy. We need a training setup which will build a heuristics generator into the AGI, and then improve that heuristics generator until it has a lot of flexible capability. That means, essentially, introducing the AGI to scenarios it’s never encountered before[2], and somehow shaping it to pass them on the first try (= for it to do something that will get reinforced).
As a CoastRunners example, consider scenarios where the race is suddenly in 3D, or in space and the “ship” is a spaceship, or the AGI is exposed to the realistic controls of the ship instead of WASD, or it needs to “win the race” by designing the fastest ship instead of actually racing, or it’s not the pilot but it wins by training the most competent pilot, or there’s a lot of weird rules to the race now, or the win condition is weird, et cetera.
Inasmuch as the heuristics generator is aligned with the implicit goal G, we’ll get an agent that looks at the context, infers what it means to “win the race” here and what it needs to do to win the race, then start directly optimizing for that. This is what we “want” our training to result in.
In this, we can be more or less successful along various dimensions:
The more varied the training scenarios are, the more clearly the training is to shape the agent into valuing winning the race, instead of any of the upstream correlates of that. “Win the race” would be the unifying factor across all reinforcement schedule structures in all of these contexts.
Likewise, the more coherent the reinforcement schedule is — the more it rewards actions that are strongly correlated with acting towards winning the race, instead of anything else — the more clearly it shapes the agent to be valuing winning, instead of whatever arbitrary thing it may end up doing.
The more “adversity” the agent encounters, the more likely it is to care only about winning. If there are scenarios where it has very few resources, but which are just enough to win if it applies them solely to winning instead of spending them on any other goal, the more it will be shaped to care only about that goal to the exclusion of (and at the expense of) everything else.
As we increase adversity and scenario diversity, the more “curious” we’ll have to make the agent’s exploration policy (to hit upon the most optimal strategies). On the flipside, we want it to have to invent creative solutions to win, as part of trying to train an AGI — so we will ramp up the adversity and the diversity. And we’d want to properly reinforce said creativity, so we’d (somehow) shape our reinforcement schedule to properly reinforce it.
Thus, there’s a correlated cluster of training parameters that increases our chances of getting an AGI: we have to put it in varied highly-adversarial scenarios to make creativity/autonomy necessary, we have to ramp up its “curiosity” to ensure it can invent creative solutions and be autonomous, and to properly reinforce all of this (and not just random behavior), we have to have a highly-coherent credit assignment system that’s able to somehow recognize the instrumental value of weird creativity and reinforce it more than random loitering around.
To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.
And by creativity’s nature of being weird, we can’t just have a “reinforce creativity” function. We’d need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be G. And indeed, this creativity-recognizing property is correlated with the reinforcement schedule’s coherency — inasmuch as R is well-described as shaped around G, it should reinforce (and not fail to reinforce) weird creativity that promotes G! Thus, we get a credit assignment system that effectively cultivates the features that’d lead to AGI (an increasingly advanced heuristics generator), but it’s done at the “cost” of making those features accurately pointed at G[3].
And this, incidentally, are the exact parameters necessary to make the training setup more “idealized”. Strictly specify G, build it into the agent, try to update away mesa-objectives that aren’t G, make it optimize for G strongly, etc.
In practice, we’ll fall short of this ideal: we’ll fail to introduce variance enough to uniquely specify winning, we’ll reinforce upstream correlates of winning and end up with an AGI that values lots of things upstream of winning, we’ll fail to have enough adversity to counterbalance this and update its other goals away, and we won’t get a perfect exploratory policy that always converges towards the actions R would reinforce the most.
But a training process’ ability to result in an AGI is anti-correlated with its distance from the aforementioned ideal.
Thus, inasmuch as we’re successful in setting up a training process that results in an AGI, we’ll end up with an agent that’s some approximation of a G-maximizing wrapper-mind.
Actually, no, apparently it’s “smash into specific objects”. How did they expect anything else to happen? Okay, but let’s pretend I’m talking about some more clearly set up version of CoastRunners, in which the simplest description of the reinforcement schedule is “when you win the race”.
More specifically, to scenarios it doesn’t have a ready-made suite of shallow heuristics for solving. It may be because the scenario is completely novel, or because the AGI did encounter it before, but it was long ago, and it got pushed out of its limited memory by more recent scenarios.
To rephrase a bit: The heuristics generator will be reinforced more if it’s pointed at G, so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at G, because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can’t robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn’t reinforce the heuristics generator a lot, and doesn’t preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn’t cultivate the capabilities needed for AGI, and therefore doesn’t result in AGI.
Doesn’t this sound weird to you? I don’t think of the chisel itself being “shaped around” the intended form, but rather, the chisel is a tool that is used to shape the statue so that the statue reflects that form. The chisel does not need to be shaped like the intended form for this to work! Recall that the reinforcement schedule is not a pure function of the reward/loss calculator, it is a function of both that and the way the policy behaves over training (the thing I was describing earlier as “The agent and its choices/computations”), which means that if we only specify the outer objective R, there may be no fact of the matter about which goal is “implied” as its natural / coherent extrapolation. It’s a 2-place function and we’ve only provided 1 argument so far.
I get your point on some vibe-level. Like, humans and other animal agents can often infer what goal another agent is trying to communicate. For instance, when I’m training a dog to sit and I keep rewarding it whenever it sits but not when it lays down or stands, we can talk about how it is contextually “implied” that the dog should sit. But most of what makes this work is not that I used a reward criterion that sharply approximates some idealized sitting recognition function (it does need to bear nonzero relation to sitting); most of the work is done by the close fit between the dog’s current behavioral repertoire and the behavior I want to train, and by the fact that the dog itself is already motivated to test out different behaviors because it likes my doggie treats, and by the way in which I use rewards as a tool to create a behavior-shaping positive feedback loop.
In practice I agree (I think, not quite sure if I get the disjunction bit). That is one reason I expect agents to not want to reconfigure themselves into wrapper-mind, because the agent has settled on many different overlapping goals all of which it endorses, and those goals don’t form a total preorder over outcomes for it to become a wrapper-mind pursuing.
I agree with this. For modern humans, I would say that this is provided by our evolutionary history + our many years of individual cognitive development + our schooling.
This is where I step off the train. It is not true that the only (or even the most likely) way for creativity to arise is for that creativity to be directed towards the selection criterion or to point towards the intended goal. It is not true that the only way for useful creativity to be recognized is by us. Creativity can be recognized by the agent as useful for its own goals, because the agent is an active participant in shaping the course of training. For anything that the agent might currently want, learning creativity is instrumentally valuable, and the benefits of creative heuristic-generation should transfer well between doing well according to its own aims and doing well by the outer optimization process’ criteria. Just like the benefits of creative heuristic-generation transfer well between problem solving in the savannah, problem solving in the elementary classroom, and problem solving in the workplace, because there is common structure shared between them (i.e. the world is lawful). I expect that just like humans, the agent will be improving its heuristic-generator across all sorts of sub(goals) for all sorts of reasons, leading to very generalized machinery for problem-solving in the world.
No, I think this is wrong as I understand it (as is the similar content in the closing paragraphs). The form of this argument looks like:
with
You need to claim something like “Y is required in order to produce sufficient Z for AGI”, not just that it produces additional Z. And I don’t buy that that’s the case. But also, I actually disagree with the premise that agents whose heuristic-generators are pointed merely instrumentally at G will have less reinforced/worse heuristics-generators than ones whose heuristic-generators are pointed terminally at G. IMO, learning the strategies that enable flexibly navigating the world is convergently useful and discoverable, in a way that is mostly orthogonal to whether or not the agent is pursuing the outer selection criterion.
Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself. That’s what I was hoping I was communicating with the 3 examples.
EDIT: I should add that I agree that, all else equal, the factors you listed in the section below are relevant to joint alignment + capability success:
The dog example was helpful, thanks. Although I usually think in terms of training from scratch/a random initialization. Still: to train e. g. a paperclip-maximizer, you don’t have to start out reinforcing it for its paperclip-making ability, you might instead teach it a world-model and some basic skills first, etc. The reinforcement schedule should dynamically reflect the agent’s current capabilities, in a way, instead of being static!
There are some points I want to make here — primarily, that it’s a break from the pure blind greedy optimization algorithm I was discussing, if the outer optimizer is intelligent enough to take the agent’s current policy or internals into account. E. g., as the ST pointed out, human values and biases are inaccessible to the genome, so the reward circuitry is frozen, can’t dynamically respond in this fashion. Same for how ML models are currently trained most of the time.
But let’s focus on a more central point of disagreement for now.
Good point to highlight: I don’t understand how you expect it to work.
For capabilities to grow more advanced, they don’t need to be just reinforced. Marginally better performance needs to be marginally more reinforced, and the exploration policy needs to allow the agent to find said marginally better performance.
Consider the situation like the CoastRunners, except where the model is primarily reinforced for winning the race. Suppose that, somehow, it learns to do donuts before winning the race. It always does a donut, in every scenario, and sometimes it wins and gets reinforced, and do-a-donut gets reinforced as well, so it’s never unlearned. But its do-a-donut behavior either never gets more advanced, or only gets more advanced inasmuch as it serves winning! It’ll never learn to do more elaborate and artful donuts (or whatever it “values” about donuts); it’ll only learn to do shorter and more winning-compatible donuts.
Consider your MuZero example. Suppose that “if the model sends out its king unprotected, output 1” is the entirety of the reward function we’re fine-tuning it on. The policy never does that… So it never gets reinforced on anything at all, so it never gets better at e. g. winning the game!
There is a way around it via gradient-hacking, of course. The model can figure out what gives it reinforcement events, recognize when it’s made a unusually good move, then “game” the system by sending out a king unprotected, in order to reinforce the entire chain of computations that led to it, which would include it showing unusual creativity.
Or the CoastRunners model can figure out what it values about donuts, and strive to make ever-more-artful donuts correlated with winning a race (e. g., by trying harder to win if it’s unusually satisfied with the cool donut move it just executed), so that its ability to donut better gets reinforced.
Is that roughly what you have in mind?
But this is an incredibly advanced capability. It requires the model to be situationally aware, already recognizing that it’s being trained scenario-to-scenario. It requires it to be reflective, able to evaluate its behavior according to its values on its own, instead of just blindly executing the adaptations it already has. It requires it to have advanced meta-cognition in general, where it’s able to reason about goals, the instrumental value of self-improvement, about what “reinforcement” does, about what seems to be causing reinforcement events, et cetera.
We don’t get to that point for free. We’d need to do a whole lot of heuristics-generator-reinforcement before the model will be able to do any of that. And until the model is advanced enough to take over like this, the outer optimizer will only preferentially reinforce creativity that serves the values implicit in the outer optimizer’s implementation, and will optimize against deploying that creativity for the model’s own values (by preferentially reinforcing only the cases where the model prioritizes “outer-approved” values to its “inner-only” ones).
That said, I’m open to counter-examples showing that the model can learn some simple way to do this kind of gradient-hacking.
I’m… a bit unclear on the details of your GPT-7 example; it vaguely seems like a possible counter, but I think it’s just because in it, the model can kind-of rewrite its reinforcement schedule? (The more it populates the database with Navy Seal copypastas, the more its outputting the Navy Seal copypasta gets reinforced, in a feedback loop?) But that really is a weird setup, I think.
Fair enough. I tend to switch between thinking about training from scratch vs. continuing from a pretrained initialization vs. something else. Always involving a substantial portion where the model does autonomous learning, though.
Yeah I agree that from the standpoint of the overseer trying to robustly align to their goal, as well as from the standpoint of the outer optimizer “trying” to find criterion-optimal policies, it would be best if they could do a sort of dynamic/interactive reinforcement that tracks the development of the agent’s capabilities through training. That’s an area of research that excites me. I do think it will be sorta difficult because of symbol grounding / information inaccessibility / ontology identification problems, but probably not hopelessly so.
I think there might be a misunderstanding here. The bolded text was not meant to be a proposal about some way to boost capability or alignment. It was meant to be a generic description of causal pathways through which autonomous learning shapes behavior/cognition. Compare to something like “Reinforcement and selection of traits/genes does not just come from a species growing its absolute population size, it also comes from individual organisms exercising selection (like a bird choosing the most brightly ornamented mate, even though that trait is ~orthogonal to absolute population growth)”.
Reality automatically hits back against poor capabilities, giving the agent feedback for its strategies (and for marginal changes to strategies) that in fact did or did not have the consequences that the agent intended them to have. Because of that, I expect that the reward function does not need to do all that much sophisticated directing, provided that the architecture and training paradigm are in the right ballpark (which they’ll need to be in order to feasibily produce AGI at all). The lion’s share of useful bits contributing to the agent’s capability development will come from the agent’s interaction with reality anyways, not from the reward function’s handholding.
No, that’s not what I had in mind. The examples aren’t supposed to be examples where we’re plausibly gonna get an AGI, they’re supposed to be examples that showcase how the agent can exercise selection, even very extreme levels of selection, in a way that decouples from the outer objective.
In general I expect we’d mostly see super simple motifs like “an agent picks up a reward-correlated or reward-orthogonal decision-influence early on, and by default that circuit sticks around and continues to somewhat influence the agent’s behavior, which exercises ‘selection’ through the policy for the rest of training”. A much less sexy and sophisticated form of gradient hacking than what you thought of.
Okay, but how does reinforcement happen, here? The CoastRunner model tries to execute a cool donut by outputting a particular pattern of commands, it succeeds, and — how does that get reinforced, if that pattern doesn’t also contribute to the agent winning the race? Where does the reinforcement event come from?
In addition, that self-teaching pattern where it can “intend” some consequences before executing a strategy, and would then evaluate the consequences that strategy actually had, presumably to then update its strategy-generating function — that’s also a fairly advanced capability that’d only appear after a lot of heuristics-generator-reinforcement, I think.
That sounds like my example with a donut-making agent whose donut-making artistry never gets reinforced; that just does the donut of the same level of artistry every time.
I don’t see how it’d robustly stick around. As long as there’s some variance in the shape of donuts the agent makes, it’d only get reinforced for making shorter donuts (because that’s correlated with it winning the race faster), and the donuts would get smaller and smaller until it stops doing them altogether.
(It didn’t happen in the actual CoastRunners scenario because it didn’t reward the model for winning the race, it rewarded it for smashing into objects.)
Are we talking about the normal case where the agent can collect powerup rewards in the lagoon, or an imagined variant where we remove those? In both cases some non-outer reinforcement comes from the positive feedback loop between the policy’s behavior and the environment’s response. Like, I’m imagining that there’s a circuit that outputs a leftward steering bias whenever it perceives the boat to be in the lagoon, which when triggered by entering the lagoon has the effect of making the boat steer leftward, which causes the boat to go in a circle, which puts the agent back somewhere in the lagoon, which causes the same circuit to trigger as it again recognizes that the boat is in the lagoon. In the case where we’re keeping the powerups, that is an additional component in the positive feedback loop where collecting the powerups creates rewards which (not necessarily immediately, if offline) strengthen the circuit that led to the rewards. The total effect of this positive feedback loop is the donut behavior reinforcing itself.
Interesting, I don’t think of it as that particularly advanced, assuming that the agent’s cognitive architecture is suitable for autonomous learning. Like, when a baby is hungry but sees his bottle, and he sends neural impulses from cortex down to his arm because he intends to reach towards the bottle, and then those impulses make his arm go in a somewhat crooked direction, so he updates on the feedback that reality just gave him about the mapping between cortical firing activity and limb control, such that next time around there’s a better match between his intended motion and his perceived motion; that sort of thing strikes me as exactly the pattern I’m describing. As the baby develops, it scaffolds up to more complex and abstract intentions, along with strategies to achieve them, but the pattern is basically the same. It does (or imagines) things with intention and uses the world (or a learned world model) to get rich feedback.
I’m not really sure what example you’re talking about here, or what the issue with this is.
It’s a neural circuit that exists in the network weights. Unless you actively disconnect or overwrite it, it won’t go anywhere.
Are you talking about the alternative version where there are no powerups in the lagoon?
I may have lost the thread of the discussion here. It sounds like what you’re asking is something like “If we don’t give rewards to that tendency at all, then won’t we gradually select away from it as time goes on and we approach convergence, even if the tendency starts off slightly biasing the training trajectories?” If that’s what you’re asking, then I would say that that is true in theory, but that there’s no such thing as convergence in the real world.
I meant the imagined variant where we’re rewarding the agent for winning the race, yeah, sorry for not clarifying. I mean the same variant in the example down this comment.
Right, I think there’s some disconnect in how we’re drawing the agent/reward circuitry boundary. This:
On my model, that’s only possible because humans learn on-line, and this update is made by the reward circuitry, not by some separate mechanism that the reward circuitry instilled into the baby. (And this particular example may not even be done via minimizing divergence from WM predictions, but via something like this.)
I agree that such a mechanism would appear eventually, even if the agent isn’t trained on-line, especially in would-be-AGI autonomous agents who’d need to learn in-context. But it’s not there by default.
How does that induce an update to the model’s parameters, though? We feed the model the current game-state as an input, it runs a forward pass, outputs “steer leftward”, we feed it the new game-state, it outputs “steer leftward” again, etc. — but none of that changes its circuits? The update only happens after the model completes the race.
And yes, at that point the do-a-donut circuits would get reinforced too, but they wouldn’t be preferentially reinforced for better satisfying the model’s values. Suppose the model, by its values, wants to make particularly “artful” donuts. Whether it makes particularly bad or particularly good donuts, they’d get reinforced the same amount at the end of the race. So the model would never get better at donut artistry as evaluated by its own values. The do-a-donut circuit would persevere if the model always makes donuts, but it’ll stay in its stunted form. No?
Oh, huh. Yes the thing you’re calling the “reward circuitry”, I would call the “reward function and value function”. When I talk about the outer optimization criterion or R, in an RL setting I am talking about the reward function, because that is the part of the “reward circuitry” whose contents we actually specify when we set up the optimization loop.
The reward function is usually some fixed function (though it could also be learned, as in RLHF) that does not read from the agent’s/policy’s full mental state. Aside from some prespecified channels (the equivalent of like hormone levels, hardwired detectors etc.), that full mental state consists of hundreds/thousands/millions/billions of signals produced from learned weights. When we write the reward function, we have no way of knowing in advance what the different activation patterns in the state will actually mean, because they’re learned representations and they may change over time. The reward function is one of the contributors to TD error calculation.
The value function is some learned function that looks at the agent’s mental state and computes outputs that it contributes to TD error calculation. TD errors are what determine the direction and strength with which circuitry gets updated from moment to moment. There needs to be a learned component to the updating process in order to do immediate/data-efficient/learned credit assignment over the mental state. (Would take a bit of space to explain this more satisfyingly. Steve has some good writing on the subject.)
That’s roughly my model of how RL works in animals, and how it will work in autonomous artificial agents. Even in an autonomous learning setup that only has prediction losses over observations and no reward, I would still expect the agent to develop something like intentions and something like updating pretty early on. The former as representations that assist it in predicting its future observations from its own computations/decisions, and the latter as a process to correct for divergences between its intentions and what actually happens[1].
By itself, this behavior-level reinforcement does not necessarily lead to parameter updates. If the only time when parameters get updated is when reward is received (this would exclude bootstrapping methods like TD for instance), and the only reward is at the end of the race, then yeah I agree, there’s no preferential updating.
But behavior-level reinforcement definitely changes the distribution of experiences that the agent collects, and in autonomous learning, the parameter updates that the outer optimizer makes depend on the experiences that the agent collects[2]. So depending on the setup, I expect that this sort of extreme positive feedback loop may either effectively freeze the parameters around their current values, or else skew them based on the skewed distribution of experiences collected, which may even lead to more behavior-level reinforcement and so on.
Not sure off the top of my head. Let’s see.
If the agent “wants” to make artful donuts, that entails there being circuits in the agent that bid for actions on the basis of some “donut artfulness”-related representations it has. Those circuits push the policy to make decisions on the basis of donut artfulness, which causes the policy to try to preferentially perform more-artful donut movements when considered, and maybe also suppress less-artful donut movements.
If the policy network is recurrent, or if it uses attention across time steps, or if it has some other form of memory, then it is possible for it to “practice” its donuts within an episode. This would entail some form of learning that uses activations rather than weight changes, which has been observed to happen in these memoryful architectures, sometimes without any specific losses or other additions to support it (like in-context learning). By the end, the agent has done a bunch of marginally-more-artful donuts, or its final few donuts are marginally more artful (if actions temporally closer to the reward are more heavily reinforced), or it donut artfulness is more consistent.
Now, if the agent is always doing donuts (like, it never ever breaks out of that feedback loop), and we’re in the setting where the only way to get parameter updates is upon receiving a reward, then no the agent will never get better across episodes. But if it is not always doing donuts, then it can head to the end of the race after it completes this “practice”. That should differentially reinforce the “practiced” more-artful donuts over less-artful donuts, right?
(To be clear, I don’t think that the real CoastRunners boat agent was nearly sophisticated enough to do this. But neither was it sophisticated enough to “want” to do artful donuts, so I feel like it’s fair to consider.)
Is there something specific you wanted to probe with this example? Again, I don’t quite know how I should be relating this example to the rest of what we’ve been talking about.
The outer optimizer has no clear way to tell what those representations mean or what even constitutes a divergence from the agent’s perspective.
I think many online learning, active learning, RL, and retrieval/memory-augmented setups fall into this category. 🤔
On my end, the argument structure goes as follows (going from claims I’m making to sub-claims that try to justify them):
AGI-level training setups attempt to build models primarily concerned with optimizing hard for some context-independent proxy of “outer-approved” values.
To get to AGI, we need a training setup that incentivizes heuristics generators, and systemically improves these generators’ capabilities.
To do that, we need a setup that a) explores enough to find marginally better heuristics-generator performance, and b) preferentially reinforces marginally better heuristics-generator performance over stagnant or worse performance.
To do that, we need some metric for “better performance”. One such metric is the outer optimizer’s reward function. Another such metric would be the model’s own values.
For the model to improve its performance across training episodes according to its own values (in ways that are orthogonal/opposed to outer-approved values), it needs to either:
Do advanced gradient-hacking, i. e. exploit the reinforcement machinery for its own purposes. That itself requires advanced general capabilities, though, so Catch-22.
Learn in-context, in a way that’s competitive with learning across episodes, such that its capabilities across only-inner-approved metrics don’t grow dramatically slower than along outer-approved metrics.
I argue that 5b is also a Catch-22, in that it requires a level of sophistication that’ll only appear after the heuristics generator has already become very developed.
So if a model can’t quickly learn to learn in-context, then for most of its training, the sophistication of its features can only improve in ways correlated with performance improvements on outer-approved values. Since “features” include the heuristics generator, the only way for the heuristics generator to grow more advanced would be by becoming better at achieving outer-approved values, so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values.
We’re now trying to agree on whether models can quickly learn some machinery for comprehensively improving in-context along metrics that are orthogonal/opposed to the “outer-approved” values.
If no, then the heuristics generator will tend to be shaped to align with outer-approved values, and AGI-capable training setups will result in a wrapper-mind-approximation.
If yes, then there would be no strong pressure to point the heuristics generator in a particular abstract direction across contexts, and we would not get a wrapper-mind-approximation.
I think that it’s a crux for me, in that if I’m unconvinced of (6), I’d have to significantly re-evaluate my model of value formation, likely in favour of mainline shard-theory views.
Okay, onto object-level:
Very interesting. I really need to read Steve’s sequence. As I don’t have a good model of how that works yet (or how it’d be implemented in a realistic AGI setup), it’s hard for me to evaluate how that’d impact my view. I’ll read the linked post and come back to this. Would also welcome links to more resources on that.
Any examples, off the top of your head?
Potential concerns (assumes no TD learning):
Even if it’s possible to easily learn to improve in-context, would the tendency or ability to do that be preferentially reinforced, if that itself is not outer-value-approved? E. g., suppose the model develops a predilection towards making donuts, and some arbitrary metric of how an “artful” donut looks like. How would it learn to practice its donut-making ability, as you described it? Why would that be reinforced by the outer optimizer?
I. e., I’d expect it to go as follows by default: The donut-valuing circuit sees that it’s in a lagoon, bids for “steer left” several times, in a way that strikes it as artful. The boat completes a loop. The donut-valuing circuit sees that it’s in a lagoon again, and bids for another loop, which it also tries to make artful...
But it doesn’t learn. It doesn’t integrate the experience from the first donut it just made to make its second donut a bit better. That’d require specialized in-context-learning machinery, and—
How would you predict it to be developed?
I think this assumes, again, some general-purpose self-improvement machinery, where the model generally gets better at anything it tries. Maybe it’s your carry-over assumption from TD learning? But if there’s no TD learning, this should be difficult to learn from scratch.
Would that in-context learning actually be competitive with in-context learning that’s also directly bolstered by preferential reinforcement from the outer optimizer? E. g., the model values both donut-making and race-winning. In a given episode, it makes a few donuts, then tries to win the race faster. The outer reinforcement system updates its weights. As the result...
Its ability to make donuts improves by ΔQdonut = (quality of its final donut) minus (quality of its first donut). Its ability to win races improves by ΔQrace = (quality of maneuvering at the start of the race) minus (quality of maneuvering at the end of the race). Those values should be at best roughly equal[1].
But in addition to in-context learning, the model’s ability to win races is also reinforced episode-to-episode! Whenever it manages to win a race unusually fast, that preferentially gets reinforced more; there’s cross-episode selection for better race-winning strategies. By comparison, there’s no such thing for donut artistry: the model only gets better at it by in-episode practice, not by selecting from cross-episode variance.
Also, here’s a formalization for the first concern, i. e. that the model’s in-context learning won’t be improved:
If ΔQrace for a particular episode is unusually high, i. e. the model showed unusual ability to get better at winning races in-context, the in-context learning machinery gets preferentially updated to be more likely to show this performance (than if it showed subpar or ordinary in-context learning).
By comparison, if ΔQdonut is unusually high, i. e. the model got better at getting better at donut-making, that doesn’t get reinforced more than if it got worse at getting better at donut-making (i. e., than if ΔQdonut this episode was below average).
Although I’d expect improvements on maneuvering to be higher than on donut-making, because I’d expect the in-context learning machinery for race-winning to be more advanced than the one for donut-making (as the outer optimizer would preferentially reinforce it). See the first concern.
I think 2 is probably true to a certain extent. But maybe not to the same extent that you are imagining. Like, I think that the primary thing that will drive the developing agent’s heuristic-generation becoming better and better is its interaction with a rich world where it can try out many different kinds of physical and mental strategies for achieving different (sub)goals. So you need to provide a rich world where there are many possible natural (sub)goals to pursue and many possible ways to try to pursue them (unlike CoastRunners, where there aren’t), and you need to architect the agent so that it is generally goal-directed, and it would probably be helpful to even do the equivalent of “putting the AI in school” / “having the AI read books” to give it a little kickstart. But that’s about all I’m imagining. I am not imagining that you need to construct your training environment to specifically incentivize all of the different facets of heuristic-generation. As the agent pursues the goals that it pursues in a complex world, it is incentivized to learn because learning is what helps it achieve its goals better.
1 seems probably false to me. If you mean that AGI-level setups, in order to work, need to be primarily concerned with that, then I definitely disagree. Like, imagine that in order to build up the AI’s cognition & skills from some baseline, you teach it that every “training day” it will experience repeated trials of some novel task, and that for every trial it completes, it’ll get some object-level thing it likes (for rats this might be sugar water, for kids this might be a new toy, for adults this might be money). The different tasks can all have different success criteria and they don’t have to have anything to do with human value proxies for this to work, right?
If you just mean that when people build AGI-level training setups, “optimizing hard for some context-independent proxy of ‘outer-approved’ values” that is what those people will have in mind in their designs, then I dunno. I don’t really feel justified in making an assumption about what considerations they’ll have in mind.
A few points.
I think that training setups that do not facilitate something like bootstrapping (i.e. modifying parameters even in some cases where there was no reward), are not competitive and will not produce AGIs. Think about how awful and slow it would be, trying to learn how to do any new and complex task, if the only time you actually learned anything was in the extremely rare instance where you happen to bumble your way through to success. No learning from mistakes, no learning incrementally from checkpoints or subgoals you set, no learning from mere exploration. I think that this sort of “learning on your own” is intimately tied with autonomy. But that is also exactly what enables you to reinforce & improve yourself in directions other than toward the outer optimization criterion.
To get an AGI-level model that pursues something other than the outer optimization criterion (what I was arguing at the top of the thread → that we don’t get an R-pursuer) under some setup, it does not need to be true that the model early in training improves its performance according to its own values in ways that are orthogonal/opposed to the outer-approved values. Think about some of the other conditions where we can get a non-R-pursuer:
Maybe the model doesn’t have any context-agnostic “values” (not even “values” about pursuing R) until after it has some decent heuristic-generation machinery built up.
OR (the most likely scenario, IMO) maybe the outer objective performance is in fact correlated with the model’s ability to perform well according to its own values. For instance, the training process is teaching the model better general purpose heuristics-generation machinery, which will also make it better at pursuing its own values (because that machinery is generally useful no matter what your goals are).
I don’t know why we are talking about “outer-approved values” G here. The influence of those outer-approved values on the AI is screened off by the concrete optimization criterion R that the designers of the training process chose when they wrote the training loop. Aren’t we talking about R-pursuers? (or R-pursuers that are wrapper-minds? I forget if you are still looking to make the case for wrapper-mind structure or merely R-pursuing behavior.)
But also this bit
does not follow from the rest of the argument. Why can’t the heuristics-generator be shaped to be a good general purpose heuristic-generator, one that the agent uses to perform well on the outer optimization criteria? Making your general-purpose heuristic-generator better is something that would always be reinforced, right? There’s no need for the heuristic-generator to care (or even know) about the outer criterion at all, if the agent is using the heuristic-generator as a flexible tool for accomplishing things in episodes. Like, why not have separation of concerns, where the heuristic-generator is a generic subroutine that takes in a (sub)goal, and there’s some other component in the agent that knows what the outer objective is?
It’s not like thinking about the context-independent goal of “win the race” will help the agent once it’s already figured out that the way to “win the race” in this environment is to first “build a fast boat”, and it now needs to solve the subproblem of “build a fast boat”. If anything, always being forced to think about the context-independent criterion is actively harmful, distracting the agent from the information that is actually decision-relevant to the subtask at hand. It also seems like it’d be hard to make a heuristic-generator that is narrowly specialized for “winning the race”, and not one that the agent can plug basically arbitrary (sub)goals into, because you’re throwing the agent into super diverse environments where what it takes to “win the race” is changing dramatically.
For the agent to adopt values that differ from those that differ from pursuing R/G (once again, I don’t think they need to be orthogonal/opposed to R/G, because aren’t you defending the claim that the agent will value R/G, not that it will merely value some correlate of it? I already believe that the agent will probably value some correlate), this machinery doesn’t need to be learned “quickly” in any absolute sense, it just needs to outpace the outer optimizer’s process of instilling its objective into context-independent values in the agent. But note that the agent doesn’t start off having context-independent values; having values like those in the first place is something I don’t expect to happen until relatively “late” in cognitive development, and at that point I’m not sure “who gets there first”, so to speak.
Like I said above, I think that constraining the heuristic-generator to always point at some specific abstract direction across contexts is at least unnecessary for the agent to do well and become smart (because it can factor out that abstract direction and input it when needed as the heuristic-generator’s current subgoal, and because improvements to the heuristic-generator are general-purpose), and possibly actively harmful for its usefulness to the agent.
This post from Steve and its dependencies is probably the best conceptual walkthrough of an example that I’ve seen. Sutton & Barto have an RL textbook with lots of good mathematical content on this.
Yeah. This is a LW discussion about one. Here are some others.
This doesn’t apply to the CoastRunners example because we are only doing rewards & weight updates at the end of the episode, but in other contexts (say, where there are multiple trials in a row, without “resets”) it can learn to practice the thing that gets rewards, and build a generalized skill around practicing, one that carries across subgoals.
Meta-level comment: I think you’re focused on what the likely training trajectories are for this particular CoastRunners example, and I am focused on what the possible training trajectories are, given the restrictions in place. I can’t tell a story about likely gradient hacking[1] there because the mechanisms that would exist in an AGI-compatible training setup that would make gradient hacking plausible have been artificially removed. The preconditions of the scenario make me think “How in the heck did we get to this point in training?”: the agent is somehow so cognitively-naive that it doesn’t have any concept of learning from trial-and-error, but it’s simultaneously so cognitively sophisticated that it already has a concept “doing a donut” and of what makes a donut “artful” and a desire around making its donuts continually more artful.
Using “gradient hacking” as a shorthand for circuits that are opposed/orthogonal/merely correlated with the outer objective to durably reinforce themselves.
Yeah, I see that’s one of the main points of disconnect between our models. Not in the sense that I necessarily disagree, in the sense that I wasn’t familiar with this factor. We probably aren’t going to resolve this conclusively until I get around to reading the TD stuff (which I plan to do shortly).
Thanks for the links!
What’s it using the heuristics generator for, then? It’s a tool for figuring out how to pursue a goal in a context you’re unfamiliar with. But if you have no context-independent goals, you can’t define a goal over a context you’re unfamiliar with, so you don’t need the heuristics generator to begin with.
Absolutely, I expect that to be the primary reason for deceptive alignment — once the model is smart enough for it.
But in this case, I argue that the heuristics generator will only be reinforced if its activity results in better performance along an outer-approved metric, which will only happen if it’s outputting heuristics useful for the outer-approved metric — which, in turn, will only happen if the model uses the heuristics generator to generate heuristics for an outer-approved value.
I’m not arguing that the heuristics generator will be specialized; I’m arguing that its improvements will be entangled with how it’s used.
E. g., two training episodes: in one the model asks for better heuristics for winning the race, in the other it asks for better donut-making heuristics.
In the former case, the heuristics generator will be reinforced, together with the model’s tendency to ask it for such heuristics.
In the latter, it wouldn’t be improved, nor would the tendency to ask it for this be reinforced.
Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.
(Or, rather, that the “command structure” around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to “build a boat” before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I’m not saying it’d be unable to use the heuristics generator flexibly.)
Ehh, not exactly. I’m defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on; and that in a hypothetical “idealized” training setup, it’d care about G precisely. When I say things like “the heuristics generator will be asked for race-winning heuristics”, I really mean “the heuristics generator will be asked for heuristics that the model ultimately intends to use for a goal that is a close correlate of winning the race”, but that’s a mouthful.
Basically, I think there are two forces there:
What are the ultimate goals the heuristics generator is used for pursuing.
How powerful the heuristics generator is.
And the more powerful it is, the more tails come apart — the closer the goal it’s used for needs to be to G, for the agent’s performance on G to not degrade as the heuristics generator’s power grows (because the model starts being able to optimize for G-proxy so hard it decouples from G). So, until the model learns deceptive alignment, I’d expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to-G to counterbalance it, etc.
And so in the situation where the outer optimizer is the only source of reinforcement, we’d have the heuristics generator either:
Stagnate at some “power level” (if the model adamantly refuses to explore towards caring more about G).
Become gradually more and more pointed at G (until it becomes situationally aware and hacks out, obviously — which, outside idealized setups, will surely happen well before it’s actually pointed at G directly).
Why can’t you? The activations from observations coming in from the environment and from the agent’s internal state will activate some contextual decision-influences in the agent’s mind. Situational unfamiliarity does not mean its mind goes blank, any more than an OOD prompt makes GPT’s mind go blank. The agent is gonna think something when it wakes up in an environment, and that something will determine how and when the agent will call upon the heuristic-generator. Maybe it first queries it with a subgoal of “acquire information about my action space” or something, I dunno.
The agent that has a context-independent goal of “win the race” is in a similar predicament: it has no way of knowing a priori what “winning the race” requires or consists of in this unfamiliar environment (neither does its heuristic-generator), no way to ground this floating motivational pointer concretely. It’s gotta try stuff out and see what this environment actually rewards, just like everybody else. The agent could have a preexisting desire to pursue whatever “winning the race” looked like in past experiences. But I thought the whole point of this randomization/diversity business was to force the agent to latch onto “win the race” as an exclusive aim and not onto its common correlates, by thrusting the agent into an unfamiliar context each time around. If so, then previous correlates shouldn’t be reliable correlates anymore in this new context, right? Or else it can just learn to care about those rather than the goal you intended.
So I don’t see how the agent with a context-independent goal has an advantage in this setup when plopped down into an unfamiliar environment.
I agree with this.
Why? I was imagining that the agent may prompt the heuristic-generator at multiple points within a single episode, inputting whatever subgoal it currently needs to generate heuristics for. If the agent is being put in super diverse environments, then these subgoals will be everything under the sun, so the heuristic-generator will have been prompted for lots of things. And if the agent is only being put in a narrow distribution of environments, then how is the heuristic-generator supposed to learn general-purpose heuristic-generation?
Can there be additional layers of “command structure” on top of that? Like, can the agent have arrived at the “reasoning from what will help it win the race” thought by reasoning from something else? (Or is this a fixed part of the architecture?) If not, then won’t this have the problem that for a long time, the agent will be terrible at reasoning about what will help it win the race (especially in new environments), which means that starting with that will be a worse-performing strategy than starting with something else (like random exploration etc.)? And then that will disincentivize making this the first/outermost/unconditional function call? So then the agent learns not to unconditionally start with reasoning from that point, and instead to only sometimes reason from that point, conditional on context?
Hmm. I am skeptical of that claim, though maybe less so depending on what exactly you mean[1].
Consider a different claim that seems mechanistically analogous to me:
Yes it is true that [differential reinforcement | relative fitness] is a selection pressure acting on the makeup of [things cared about | traits] across the [circuits | individuals] within a [agent | population], but AFAICT it is not true that the [agent | population] increases in [reward performance | absolute fitness] over the course of continual selection pressure.
Yeah that may be a part of where our mental models differ. I don’t expect the balance of how much power the agent has over training vs. how close its goals are to the outer criterion to go in lockstep. I see “deceptive alignment” as part of a smooth continuum of agent-induced selection that can decouple the agent’s concerns from the optimization process’ criteria, with “the agent’s exploration is broken” as a label for the cognitively less sophisticated end of that continuum, and “deceptive alignment” as a label for the cognitively more sophisticated end of that continuum. And I think that that even the not-explicitly-intended pressures at the unsophisticated end of that continuum are quite strong, enough to make “the agent tends to be shaped to care about increasingly closer correlates of G” abstraction leak hard.
EDIT: Moved some stuff into a footnote.
Like, for a given training run, as the training run progresses, the agent will be shaped to care about closer and closer correlates of G? (Just closer on average? Monotonically closer? What about converging at some non-G correlate?) Or like, among a bunch of training runs, as the training runs progress, the closeness of the [[maximally close to G] correlate that any agent cares about] to G keeps increasing?
Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?
(Also, do note if I’m failing to answer some important question you pose. I’m trying to condense responses and don’t answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)
Mm, yes, in a certain sense. Further refining: “over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of G regardless of the environment they’re in”. I do imagine that a given agent may orient themselves towards different G-correlates depending on what specific stimuli they’ve been exposed to this episode/what context they’ve started out in. But I argue that it’ll tend to be a G-correlate, and that the average closeness of G-correlates across all contexts will tend to decrease as training goes on.
E. g., suppose the agent is trained on a large set of different games, and the intended G is to teach it to value winning. I argue that, if we successfully teach the agent autonomy (i. e., it wouldn’t just be a static bundle of heuristics, but it’d have a heuristics generator that’d allow it to adapt even to OOD games), there’d be some structure inside it which:
Analyses the game it’s in[1] and spits out some primary goal[2] it’s meant to achieve in it,
… and then all prompting of the heuristics-generator is downstream of that primary goal/in service to it,
… and that environment-specific goal is always a close correlate of G, such that pursuing it in this environment correlates with promoting G/would be highly reinforced by the outer optimizer[3],
… and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to G.
(This is what my giant post is all about.)
Sure, but I’m arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent’s features can’t grow more sophisticated if the agent’s concerns decouple from the optimization process’ criteria.
The agent’s goals can decouple all it wants, but it’ll only grow more advanced if it growing more advanced is preferentially reinforced by the outer optimizer. And that’ll only happen if it being more advanced is correlated with better performance on outer-approved metrics.
Which will only happen if it uses its growing advancedness to do better at the outer-approved metrics.
Which can happen either via deceptive alignment, or by it actually caring about the outer-approved metrics more (= caring about a closer correlate of the outer-approved metrics (= changing its “command structure” such that it tends to recover environment-specific primary goals that are a closer correlate of the outer-approved metrics in any given environment)).
And if it can’t yet do deceptive alignment, and its exploration policy is such that it just never explores “caring about a closer correlate of the outer-approved metrics”, its features never grow more advanced.
And so it stagnates and doesn’t go AGI.
Which may be done by active actions too, as you suggested — this process might start with the agent setting “acquire information about my environment” as its first (temporary) goal, even before it derives its “terminal” goal.
Or some weighted set of goals.
Though it’s not necessarily even the actual win condition of the specific game, just something closely correlated with it.
Maybe? I dunno. It feels like the model that you are arguing for is qualitatively pretty different than the one I thought you were at the top of the thread (this might be my fault for misinterpreting the OP):
You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
You are arguing that in the limit, what the agent cares about will either tend to correlate more and more closely to outer performance or “peter out” (from our perspective) at some fixed level of sophistication, not arguing that in the limit, what the agent cares about will unconditionally tend to correlate more and more closely to outer performance
You are arguing that agents of growing sophistication will increasingly tend to pursue some goal that’s a natural interpretation of the intent of R, not arguing the agents of growing sophistication will increasingly tend to pursue R itself (i.e. making decisions on the basis of R, even where R and the intended goal come apart)
You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups
I don’t think I disagree all that much with what’s stated above. Somewhat skeptical most of the claims, but I could definitely be convinced.
The part I think I’m still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.
That’s fine. Again, I don’t think the setups where the end of episode rewards are only source of reinforcement are setups where the agent’s cognition can grow relevantly sophisticated in any case, regardless of decoupling.
Hmm I don’t understand how this works if we’re randomizing the environments, because aren’t we breaking those correlations so the agent doesn’t latch onto them instead of the real goal? Also, in what you’re describing, it doesn’t seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.
AFAICT it will spit out the sorts of goals that it has been historically reinforced for spitting out in relevantly-similar environments, but there’s no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).
I think (1) we probably won’t get sophisticated autonomous cognition within the kind of setup I think you’re imagining, regardless of coupling (2) knowing that the agent’s cognition won’t grow sophisticated in training-orthogonal ways seems kinda useful if we could do it, come to think of it.
As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication. So I don’t see why we should expect that the outer optimizer will asymptotically succeed at instilling the goal. In order to do that, it needs to fully build in the right cognition before the agent reaches a level of sophistication where, in the same way as RL runs early on can “effectively stop exploring” and that locks in the current policy, RL runs later on (at the point where the agent is advanced in the way you describe) can “effectively stop directing its in-context learning (or whatever other mechanism you’re saying would allow it to continue growing in advancedness without actually caring about the outer metrics more) at the intended goal” and that locks in its not-quite-correct goal. To say that that won’t happen, that it will always either lock itself in before this point or end up aligned to a (very close correlate of) G, you need to make some very specific claims about the empirical balance of selection.
I think I’m doing both. I’m using behavioral arguments as a foundation, because they’re easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.
Yeah, that’s a legitimate difference from my initial position: wasn’t considering alternate setups like this when I wrote the post.
Mainly because I don’t want to associate my statements with “reward is the optimization target”, which I think is a rather wrong intuition. As long as we’re talking about the fuzzy category of “correlates”, I don’t think it matters much? Inasmuch as R and G are themselves each other’s close correlates, so a close correlate of one is likely a close correlate of another.
Consider an agent that’s been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What’s likely happening, internally?
The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
Once that’s done, it needs to decide what to do in it. It feeds the world-model to some “goal generator” feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.
To Q1: The agent doesn’t have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G.
To Q2: Doesn’t it? We’re prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of “win the game” in every one of them; and then the agent is primarily motivated by that correlate. Isn’t this basically the same as “pursuing a correlate of G independent of the environment”?
As to whether it’s motivated to pursue the G-correlate because it’s a G-correlate — to answer that, we need to speculate on the internals of the “goal generator”. If it reliably spits out local G-correlates, even in environments it never saw before, doesn’t that imply that it has a representation of a context-independent correlate of G, which it uses as a starting point for deriving local goals?
If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would’ve been able to just memorize a goal for every environment it’s seen, and this procedure just de-compresses them.
But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent G-correlate?
Well, you do address this:
… I don’t see a meaningful difference, here. There’s some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local G-correlate. What are the practical differences between describing that data structure as “a context-independent correlate of G” versus “decision-relevant factors in the environment”?
Or, perhaps a better question to ask is, what are some examples of these “decision-relevant factors in the environment”?
E. g., in the games example, I imagine something like:
The agent is exposed to the new environment; a multiplayer FPS, say.
It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
As it’s doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
Then it recognizes some factors in that abstract representation, goes “in environments like this, I must behave like this”, and “behave like this” is some efficient strategy for scoring the highest.
Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS’ actual mechanics.
Do you have something significantly different in mind?
I still don’t see it. I imagine “deceptive alignment”, here, to mean something like:
“The agent knows G, and that scoring well at G reinforces its cognition, but it doesn’t care about G. Instead, it cares about some V. Whenever it notices its capabilities improve, it reasons that this’ll make it better at achieving V, so it attempts to do better at G because it wants the outer optimizer to preferentially reinforce said capabilities improvement.”
This lets it decouple its capabilities growth from G-caring: its reasoning starts from V, and only features G as an instrumental goal.
But what’s the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on G?
Can you walk me through that spectrum, of “bad exploration” to “deceptive alignment”? How does one incrementally transform into the other?
I don’t think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent’s behavioral capabilities (the actual business logic that carries out stuff like “recall the win conditions from relevantly-similar environments” and “do deductive reasoning” and “don’t die”), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent’s influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it’s already really good at fooling the discriminator on. This is something that happens all the time, under the label of “mode collapse”.
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent “wants” and what it doesn’t get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent “wants” and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it’ll always happen in autonomous learning setups.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what “winning” means is different in each environment. The “goal generator” function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent’s world model and produces contextually-relevant action recommendations (like “take such-and-such immediate action”, or “set such-and-such as the current goal-image”), with this mapping having been learned from past reward events and self-supervised learning.
Not hard-coded heuristics. Heuristics learned through experience. I don’t understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of “win the game” out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be “stand still for 30 seconds”, or “gather all the guns into a pile and light it on fire”? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it’s not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
Depending on what you mean by OOD, I’m actually not sure if the sort of goal-generator you’re describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we’re choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
When I say “decision-relevant factors in the environment” I mean something like seeing that you’re in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other “team”. Not sure what “context-independent correlate of G” is. Was that my phrase or yours? 🤔
Nah that’s pretty similar to what I had in mind.
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an “expert” mode and a “novice” mode. The reward function in “expert” mode only gives rewards for winning, while in “novice” mode it also gives small rewards each turn based on material balance (to encourage “fair” play for new learners). Early in training the agent rapidly finds that there’s a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the “novice” condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the “expert” condition.
Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to “steer” conversions towards English whenever possible.
Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.
I think it’s the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.