At some level I agree with this post—policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs.
However, I’m confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.
As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don’t think it changes anything I say here.)
It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn’t thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
I generally agree that gradient descent won’t find optimal policies. But I don’t understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about “optimum of loss function” from “local optimum found by gradient descent” (since you are saying that thinking about “optimum of loss function” is systematically misleading). But I don’t understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
The main concrete thing you say in this post is that humans don’t seem to optimize reward. I want to make two observations about that:
Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).
At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.
Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn’t change the existence of one strong reason.
That said, I agree that this isn’t a theorem or anything, and it’s great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I’m mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it’s the very strong take here that seems unreasonable.
Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.
policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent
I think I believe the first claim, which I understand to mean “early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective.” But that wasn’t actually a point I was trying to make in this post.
in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the “and then SGD hits a ‘global reward optimum’” picture. I’m going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.
Critic-based approaches like DQN have a highly nonstationary loss landscape. The TD-error loss landscape depends on the action replay buffer; the action replay buffer depends on the policy (in ϵ-greedy exploration, the greedy action depends on the Q-network); the policy depends on past updates; the past updates depend on past action replay buffers… The high nonstationarity in the loss landscape basically makes gradient hacking easy in RL (and e.g. vanilla PG seems to confront similar issues, even though it’s directly climbing the reward landscape). For one, the DQN agent just isn’t updating off of experiences it hasn’t had.
For a sufficient situation illustrating this kind of problem, consider a smart reflective agent which has historically had computations reinforced when it attained a raspberry (with reward 1):
In this new task, this agent has to navigate a maze to get the 100-reward blueberry. Will agents be forced to get the blueberry?
Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future. Effectively, this means that the agent’s “gradient”/expected-update in the reward landscape is zero along dimensions which would increase the probability it gets blueberries.
So it’s not just a matter of SGD being suboptimal given a fixed data distribution. If the agent doesn’t have an extremely strong “forced to try all actions forever” guarantee (which it won’t, because it’s embedded and can modify its own learning process), the reward landscape is full of stable attractors which enforce zero exploration towards updates which would push it towards becoming a wireheader, and therefore its expected-update will be zero along these dimensions. More extremely, you can have the inner agent just stop itself from being updated in certain ways (in order to prevent value drift towards reward-optimization); this intervention is instrumentally convergent.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
I did leave a footnote:
Of course, credit assignment doesn’t just reshuffle existing thoughts. For example, SGD raises image classifiers out of the noise of the randomly initialized parameters. But the refinements are local in parameter-space, and dependent on the existing weights through which the forward pass flowed.
However, I think your comment deserves a more substantial response. I actually think that, given just the content in the post, you might wonder why I believe SGD can train anything at all, since there is only noise at the beginning.[1]
Here’s one shot at a response: Consider an online RL setup. The gradient locally changes the computations so as to reduce loss or increase the probability of taking a given action at a given state; this process is triggered by reward; an agent’s gradient should most naturally hinge on modeling parts of the world it was (interacting with/observing/representing in its hidden state) while making this decision, and not necessarily involve modeling the register in some computer somewhere which happens to e.g. correlate perfectly with the triggering of credit assignment.
For example, in the batched update regime, when an agent gets reinforced for completing a maze by moving right, the batch update will upweight decision-making which outputs “right” when the exit is to the right, but which doesn’t output “right” when there’s a wall to the right. This computation must somehow distinguish between exits and walls in the relevant situations. Therefore, I expect such an agent to compute features about the topology of the maze. However, the same argument does not go through for developing decision-relevant features computing the value of the antecedent-computation-reinforcer register.
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
I don’t know what you mean by a “perfect” reward signal, or why that has something to do with exploration difficulty, or why no exploration is needed for my arguments to go through? I think if we assume the agent is forced to wirehead, it will become a wireheader. This implies that my claim is mostly focused on exploration & gradient hacking.
Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making.
Not claiming that people are pure RL. Let’s wait until future posts to discuss.
(I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.)
Seems unrelated to me; considerable complexity in human behavior does not imply considerable complexity in the learning algorithm; GPT-3 is far more complex than its training process.
I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining
The point is that the argument “We’re selecting for agents on reward → we get an agent which optimizes reward” is locally invalid. “We select for agents on X → we get an agent which optimizes X” is not true for the case of evolution (which didn’t find inclusive-genetic-fitness optimizers), so it is not true in general, so the implication doesn’t necessarily holdin the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B.
This is not mechanistic, as I use the word. I understand “mechanistic” to mean something like “Explaining the causal chain by which an event happens”, not just “Explaining why an event should happen.” However, it is an argument for the latter, and possibly a good one. But the supervised case seems way different than the RL case.
The GPT-3 example is somewhat different. Supervised learning provides exact gradients towards the desired output, unlike RL. However, I think you could have equally complained “I don’t see why you think RL policies ever learn anything”, which would make an analogous point.
Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)
Not the OP, so I’ll try to explain how I understood the post based on past discussions. [And pray that I’m not misrepresenting TurnTrout’s model.]
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
As I read it, the post is not focused on some generally-applicable suboptimality of SGD, nor is it saying that policies that would maximize reward in training need to explicitly represent reward.
It is mainly talking about an identifiability gap within certain forms of reinforcement learning: there is a range of cognition compatible with the same reward performance. Computations that have the side effect of incrementing reward—because, for instance, the agent is competently trying to do the rewarded task—would be reinforced if the agent adopted them, in the same way that computations that act *in order to* increment reward would. Given that, some other rationale beyond the reward performance one seems necessary in order for us to expect the particular pattern of reward optimization (“reward but no task completion”) from RL agents.
In addition to the identifiability issue, the post (as well as Steve Byrnes in a sister thread) notes a kind of inner alignment issue. Because an RL agent influences its own training process, it can steer itself towards futures where its existing motivations are preserved instead of being modified (for example, modified into reward optimizing ones). In fact, that seems more and more likely as the agent grows towards strategic awareness, since then it could model how its behavior might lead to its goals being changed. This second issue is dependent on the fact we are doing local search, in that the current agent can sway which policies are available for selection.
Together these point towards a certain way of reasoning about agents under RL: modeling their current cognition (including their motivations, values etc.) as downstream of past reinforcement & punishment events. I think that this kind of reasoning should constrain our expectations about how reinforcement schedules + training environments + inductive biases lead to particular patterns of behavior, in a way that is more specific than if we were only reasoning about reward-optimal policies. Though I am less certain at the moment about how to flesh that out.
I interpret OP (though this is colored by the fact that I was thinking this before I read this) as saying Adaptation-Executers, not Fitness-Maximizers, but about ML. At which point you can open the reference category to all organisms.
Gradient descent isn’t really different from what evolution does. It’s just a bit faster, and takes a slightly more direct line. Importantly, it’s not more capable of avoiding local maxima (per se, at least).
Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
If you write code for a model-based RL agent, there might be a model that’s updated by self-supervised learning, and actor-critic parts that involve TD learning, and there’s stuff in the code that calculates the reward function, and other odds and ends like initializing the neural architecture and setting the hyperparameters and shuttling information around between different memory locations and so on.
On the one hand, “there is a lot of stuff going on” in this codebase.
On the other hand, I would say that this codebase is for “an RL agent”.
You use the word “pure” (“Humans do not appear to be purely RL agents…”), but I don’t know what that means. If a model-based RL agent involves self-supervised learning within the model, is it “impure”?? :-P
The thing I describe above is very roughly how I propose the human brain works—see Posts #2–#7 here. Yes it’s absolutely a “conjecture”—for example, I’m quite sure Steven Pinker would strongly object to it. Whether it’s “surprising a priori” or not goes back to whether that proposal is “entirely described by RL” or not. I guess you would probably say “no that proposal is not entirely described by RL”. For example, I believe there is circuitry in the brainstem that regulates your heart-rate, and I believe that this circuitry is specified in detail by the genome, not learned within a lifetime by a learning algorithm. (Otherwise you would die.) This kind of thing is absolutely part of my proposal, but probably not what you would describe as “pure RL”.
It sounded like OP was saying: using gradient descent to select a policy that gets a high reward probably won’t produce a policy that tries to maximize reward. After all, look at humans, who aren’t just trying to get a high reward.
And I am saying: this analogy seem like it’s pretty weak evidence, because human brains seem to have a lot of things going on other than “search for a policy that gets high reward,” and those other things seem like they have a massive impacts on what goals I end up pursuing.
ETA: as a simple example, it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do? If you’ve already written about that somewhere it might be interesting to see. Right now the theory “human preferences are entirely produced by doing RL on an intrinsic reward function” seems to me to make a lot of bad predictions and not really have any evidence supporting it (in contrast with a more limited theory about RL-amongst-other-things, which seems more solid but not sufficient for the inference you are trying to make in this post).
I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.
(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
And another thing that perfect exploration would entail is trying every addictive drug (let’s say cocaine), lots of times, in which case reinforcement learning would lead to addiction.
So, just as the RL agent would (presumably) be designed to be able to make a foresighted decision not to try dropping an anvil on its head, that same design would also incidentally enable it to make a foresighted decision not to try taking lots of cocaine and getting addicted. (We expect it to make the latter decision because of instrumental convergence goal-preservation drive.) So it might wind up never wireheading, and if so, that would be intimately related to its incomplete exploration.
(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
This was mentioned in OP (“The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.”). It also appears to be a much stronger argument for the OP’s position and so seemed worth responding to.
I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
If the OP is not intending to talk about the kind of ML algorithm deployed in practice, then it seems like a lot of the implications for AI safety would need to be revisited. (For example, if it doesn’t apply to either policy gradients or the kind of model-based control that has been used in practice, then that would be a huge caveat.)
It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.
Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)
But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent may deliberately avoid encountering edge-case situations where reward would come apart from task completion. (In the same way that I deliberately avoid taking highly-addictive drugs.)
Why? Because of instrumental convergence goal-preservation drive. After all, encountering those situations would lead its no longer valuing task completion.
So, deliberately-imperfect exploration is a mechanism that allows the RL agent to (perhaps) stably value something other than reward, even in the absence of perfect correlation between reward and that thing.
(By the way, in my mind, nothing here should be interpreted as a safety proposal or argument against x-risk. Just a discussion of algorithms! As it happens, I think wireheading is bad and I am very happy for RL agents to have a chance at permanently avoiding it. But I am very unhappy with the possibility of RL agents deciding to lock in their values before those values are exactly what the programmers want them to be. I think of this as sorta in the same category as gradient hacking.)
This comment seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately avoid blueberries to prevent value drift.
Risk from Learned Optimization seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately get blueberries to prevent value drift.
What’s going on here? Are these predictions in opposition to each other, or do they apply to different situations?
It seems to me that in the first case we’re imagining (the agent predicting) that getting blueberries will reinforce thoughts like ‘I should get blueberries’, whereas in the second case we’re imagining it will reinforce thoughts like ‘I should get blueberries in service of my ultimate goal of getting raspberries’. When should we expect one over the other?
I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
Right. So if selection acts on policies, each policy should aim to maximise reward in any episode in order to maximise its frequency in the population. But if selection acts on particular aspects of policies, a policy should try to get reward for doing things it values, and not for things it doesn’t, in order to reinforce those values. In particular this can mean getting less reward overall.
Does this suggest a class of hare-brained alignment schemes where you train with a combination of inter-policy and infra-policy updates to take advantage of the difference?
For example you could clearly label which episodes are to be used for which and observe whether a policy consistently gets more reward in the former case than the latter. If it does, conclude it’s sophisticated enough to reason about its training setup.
Or you could not label which is which, and randomly switch between the two, forcing your agents to split the difference and thus be about half as successful at locking in their values.
+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence—so if Paul’s statement is intended to refer to optimal policies, I’d be curious why he thinks that’s the most important case to focus on.
This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer’s preferences would be if the AI entirely seized control of the reward.
But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn’t reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.
Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here.
as a simple example, it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward.
I’m trying to parse out what you’re saying here, to understand whether I agree that human behavior doesn’t seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.
On my model, the outer objective of inclusive genetic fitness created human mesaoptimizers with inner objectives like “desire your children’s success” or “fear death”, which are decent approximations of IGF (given that directly maximizing IGF itself is intractable as it’s a Nash equilibrium of an unknown game). It seems to me that human behavior policies are actually well-approximated as those of RL agents maximizing [our children’s success] + [not dying] + [retaining high status within the tribe] + [being exposed to novelty to improve our predictive abilities] + … .
Humans do sometimes construct modified internal versions of these rewards based on pre-existing learned representations (e.g. desiring your adopted children’s success) - is that what you’re pointing at?
Generally interested to hear more of the “bad predictions” this model makes.
I’m trying to parse out what you’re saying here, to understand whether I agree that human behavior doesn’t seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.
What do you mean by “inner learned reward”? This post points out that even if humans were “pure RL agents”, we shouldn’t expect them to maximize their own reward. Maybe you mean “inner mesa objectives”?
it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do?
That’s the foundational assumption of the shard theory that this sequence is introducing, yes. Here’s the draft of a fuller overview that goes into some detail as to how that’s supposed to work. (Uh, to avoid confusion: I’m not affiliated with the theory. Just spreading information.)
I would disagree that it is an assumption. That same draft talks about the outsized role of self-supervised learning on determining particular ordering and kinds of concepts that humans desires latch onto. Learning from reinforcement is a core component in value formation (under shard theory), but not the only one.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
I’ve updated the post to clarify. I think focus on “antecedent computation reinforcement” (while often probably ~accurate) was imprecise/wrong for reasons like this. I now instead emphasize that the math of policy gradient approaches means that rewardchisels cognitive circuits into networks.
At some level I agree with this post—policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs.
However, I’m confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.
As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don’t think it changes anything I say here.)
It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn’t thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
I generally agree that gradient descent won’t find optimal policies. But I don’t understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about “optimum of loss function” from “local optimum found by gradient descent” (since you are saying that thinking about “optimum of loss function” is systematically misleading). But I don’t understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
The main concrete thing you say in this post is that humans don’t seem to optimize reward. I want to make two observations about that:
Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).
Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn’t change the existence of one strong reason.
That said, I agree that this isn’t a theorem or anything, and it’s great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I’m mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it’s the very strong take here that seems unreasonable.
Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.
I think I believe the first claim, which I understand to mean “early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective.” But that wasn’t actually a point I was trying to make in this post.
Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the “and then SGD hits a ‘global reward optimum’” picture. I’m going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.
Critic-based approaches like DQN have a highly nonstationary loss landscape. The TD-error loss landscape depends on the action replay buffer; the action replay buffer depends on the policy (in ϵ-greedy exploration, the greedy action depends on the Q-network); the policy depends on past updates; the past updates depend on past action replay buffers… The high nonstationarity in the loss landscape basically makes gradient hacking easy in RL (and e.g. vanilla PG seems to confront similar issues, even though it’s directly climbing the reward landscape). For one, the DQN agent just isn’t updating off of experiences it hasn’t had.
For a sufficient situation illustrating this kind of problem, consider a smart reflective agent which has historically had computations reinforced when it attained a raspberry (with reward 1):
In this new task, this agent has to navigate a maze to get the 100-reward blueberry. Will agents be forced to get the blueberry?
Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future. Effectively, this means that the agent’s “gradient”/expected-update in the reward landscape is zero along dimensions which would increase the probability it gets blueberries.
So it’s not just a matter of SGD being suboptimal given a fixed data distribution. If the agent doesn’t have an extremely strong “forced to try all actions forever” guarantee (which it won’t, because it’s embedded and can modify its own learning process), the reward landscape is full of stable attractors which enforce zero exploration towards updates which would push it towards becoming a wireheader, and therefore its expected-update will be zero along these dimensions. More extremely, you can have the inner agent just stop itself from being updated in certain ways (in order to prevent value drift towards reward-optimization); this intervention is instrumentally convergent.
I did leave a footnote:
However, I think your comment deserves a more substantial response. I actually think that, given just the content in the post, you might wonder why I believe SGD can train anything at all, since there is only noise at the beginning.[1]
Here’s one shot at a response: Consider an online RL setup. The gradient locally changes the computations so as to reduce loss or increase the probability of taking a given action at a given state; this process is triggered by reward; an agent’s gradient should most naturally hinge on modeling parts of the world it was (interacting with/observing/representing in its hidden state) while making this decision, and not necessarily involve modeling the register in some computer somewhere which happens to e.g. correlate perfectly with the triggering of credit assignment.
For example, in the batched update regime, when an agent gets reinforced for completing a maze by moving right, the batch update will upweight decision-making which outputs “right” when the exit is to the right, but which doesn’t output “right” when there’s a wall to the right. This computation must somehow distinguish between exits and walls in the relevant situations. Therefore, I expect such an agent to compute features about the topology of the maze. However, the same argument does not go through for developing decision-relevant features computing the value of the antecedent-computation-reinforcer register.
I don’t know what you mean by a “perfect” reward signal, or why that has something to do with exploration difficulty, or why no exploration is needed for my arguments to go through? I think if we assume the agent is forced to wirehead, it will become a wireheader. This implies that my claim is mostly focused on exploration & gradient hacking.
Not claiming that people are pure RL. Let’s wait until future posts to discuss.
Seems unrelated to me; considerable complexity in human behavior does not imply considerable complexity in the learning algorithm; GPT-3 is far more complex than its training process.
The point is that the argument “We’re selecting for agents on reward → we get an agent which optimizes reward” is locally invalid. “We select for agents on X → we get an agent which optimizes X” is not true for the case of evolution (which didn’t find inclusive-genetic-fitness optimizers), so it is not true in general, so the implication doesn’t necessarily hold in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
This is not mechanistic, as I use the word. I understand “mechanistic” to mean something like “Explaining the causal chain by which an event happens”, not just “Explaining why an event should happen.” However, it is an argument for the latter, and possibly a good one. But the supervised case seems way different than the RL case.
The GPT-3 example is somewhat different. Supervised learning provides exact gradients towards the desired output, unlike RL. However, I think you could have equally complained “I don’t see why you think RL policies ever learn anything”, which would make an analogous point.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)
Not the OP, so I’ll try to explain how I understood the post based on past discussions. [And pray that I’m not misrepresenting TurnTrout’s model.]
As I read it, the post is not focused on some generally-applicable suboptimality of SGD, nor is it saying that policies that would maximize reward in training need to explicitly represent reward.
It is mainly talking about an identifiability gap within certain forms of reinforcement learning: there is a range of cognition compatible with the same reward performance. Computations that have the side effect of incrementing reward—because, for instance, the agent is competently trying to do the rewarded task—would be reinforced if the agent adopted them, in the same way that computations that act *in order to* increment reward would. Given that, some other rationale beyond the reward performance one seems necessary in order for us to expect the particular pattern of reward optimization (“reward but no task completion”) from RL agents.
In addition to the identifiability issue, the post (as well as Steve Byrnes in a sister thread) notes a kind of inner alignment issue. Because an RL agent influences its own training process, it can steer itself towards futures where its existing motivations are preserved instead of being modified (for example, modified into reward optimizing ones). In fact, that seems more and more likely as the agent grows towards strategic awareness, since then it could model how its behavior might lead to its goals being changed. This second issue is dependent on the fact we are doing local search, in that the current agent can sway which policies are available for selection.
Together these point towards a certain way of reasoning about agents under RL: modeling their current cognition (including their motivations, values etc.) as downstream of past reinforcement & punishment events. I think that this kind of reasoning should constrain our expectations about how reinforcement schedules + training environments + inductive biases lead to particular patterns of behavior, in a way that is more specific than if we were only reasoning about reward-optimal policies. Though I am less certain at the moment about how to flesh that out.
I interpret OP (though this is colored by the fact that I was thinking this before I read this) as saying Adaptation-Executers, not Fitness-Maximizers, but about ML. At which point you can open the reference category to all organisms.
Gradient descent isn’t really different from what evolution does. It’s just a bit faster, and takes a slightly more direct line. Importantly, it’s not more capable of avoiding local maxima (per se, at least).
If you write code for a model-based RL agent, there might be a model that’s updated by self-supervised learning, and actor-critic parts that involve TD learning, and there’s stuff in the code that calculates the reward function, and other odds and ends like initializing the neural architecture and setting the hyperparameters and shuttling information around between different memory locations and so on.
On the one hand, “there is a lot of stuff going on” in this codebase.
On the other hand, I would say that this codebase is for “an RL agent”.
You use the word “pure” (“Humans do not appear to be purely RL agents…”), but I don’t know what that means. If a model-based RL agent involves self-supervised learning within the model, is it “impure”?? :-P
The thing I describe above is very roughly how I propose the human brain works—see Posts #2–#7 here. Yes it’s absolutely a “conjecture”—for example, I’m quite sure Steven Pinker would strongly object to it. Whether it’s “surprising a priori” or not goes back to whether that proposal is “entirely described by RL” or not. I guess you would probably say “no that proposal is not entirely described by RL”. For example, I believe there is circuitry in the brainstem that regulates your heart-rate, and I believe that this circuitry is specified in detail by the genome, not learned within a lifetime by a learning algorithm. (Otherwise you would die.) This kind of thing is absolutely part of my proposal, but probably not what you would describe as “pure RL”.
It sounded like OP was saying: using gradient descent to select a policy that gets a high reward probably won’t produce a policy that tries to maximize reward. After all, look at humans, who aren’t just trying to get a high reward.
And I am saying: this analogy seem like it’s pretty weak evidence, because human brains seem to have a lot of things going on other than “search for a policy that gets high reward,” and those other things seem like they have a massive impacts on what goals I end up pursuing.
ETA: as a simple example, it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do? If you’ve already written about that somewhere it might be interesting to see. Right now the theory “human preferences are entirely produced by doing RL on an intrinsic reward function” seems to me to make a lot of bad predictions and not really have any evidence supporting it (in contrast with a more limited theory about RL-amongst-other-things, which seems more solid but not sufficient for the inference you are trying to make in this post).
I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.
(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
And another thing that perfect exploration would entail is trying every addictive drug (let’s say cocaine), lots of times, in which case reinforcement learning would lead to addiction.
So, just as the RL agent would (presumably) be designed to be able to make a foresighted decision not to try dropping an anvil on its head, that same design would also incidentally enable it to make a foresighted decision not to try taking lots of cocaine and getting addicted. (We expect it to make the latter decision because of instrumental convergence goal-preservation drive.) So it might wind up never wireheading, and if so, that would be intimately related to its incomplete exploration.
This was mentioned in OP (“The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.”). It also appears to be a much stronger argument for the OP’s position and so seemed worth responding to.
It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
If the OP is not intending to talk about the kind of ML algorithm deployed in practice, then it seems like a lot of the implications for AI safety would need to be revisited. (For example, if it doesn’t apply to either policy gradients or the kind of model-based control that has been used in practice, then that would be a huge caveat.)
Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.
Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)
But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent may deliberately avoid encountering edge-case situations where reward would come apart from task completion. (In the same way that I deliberately avoid taking highly-addictive drugs.)
Why? Because of instrumental convergence goal-preservation drive. After all, encountering those situations would lead its no longer valuing task completion.
So, deliberately-imperfect exploration is a mechanism that allows the RL agent to (perhaps) stably value something other than reward, even in the absence of perfect correlation between reward and that thing.
(By the way, in my mind, nothing here should be interpreted as a safety proposal or argument against x-risk. Just a discussion of algorithms! As it happens, I think wireheading is bad and I am very happy for RL agents to have a chance at permanently avoiding it. But I am very unhappy with the possibility of RL agents deciding to lock in their values before those values are exactly what the programmers want them to be. I think of this as sorta in the same category as gradient hacking.)
This comment seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately avoid blueberries to prevent value drift.
Risk from Learned Optimization seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately get blueberries to prevent value drift.
What’s going on here? Are these predictions in opposition to each other, or do they apply to different situations?
It seems to me that in the first case we’re imagining (the agent predicting) that getting blueberries will reinforce thoughts like ‘I should get blueberries’, whereas in the second case we’re imagining it will reinforce thoughts like ‘I should get blueberries in service of my ultimate goal of getting raspberries’. When should we expect one over the other?
I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
Right. So if selection acts on policies, each policy should aim to maximise reward in any episode in order to maximise its frequency in the population. But if selection acts on particular aspects of policies, a policy should try to get reward for doing things it values, and not for things it doesn’t, in order to reinforce those values. In particular this can mean getting less reward overall.
Does this suggest a class of hare-brained alignment schemes where you train with a combination of inter-policy and infra-policy updates to take advantage of the difference?
For example you could clearly label which episodes are to be used for which and observe whether a policy consistently gets more reward in the former case than the latter. If it does, conclude it’s sophisticated enough to reason about its training setup.
Or you could not label which is which, and randomly switch between the two, forcing your agents to split the difference and thus be about half as successful at locking in their values.
+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence—so if Paul’s statement is intended to refer to optimal policies, I’d be curious why he thinks that’s the most important case to focus on.
This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer’s preferences would be if the AI entirely seized control of the reward.
But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn’t reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.
Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here.
I’m trying to parse out what you’re saying here, to understand whether I agree that human behavior doesn’t seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.
On my model, the outer objective of inclusive genetic fitness created human mesaoptimizers with inner objectives like “desire your children’s success” or “fear death”, which are decent approximations of IGF (given that directly maximizing IGF itself is intractable as it’s a Nash equilibrium of an unknown game). It seems to me that human behavior policies are actually well-approximated as those of RL agents maximizing [our children’s success] + [not dying] + [retaining high status within the tribe] + [being exposed to novelty to improve our predictive abilities] + … .
Humans do sometimes construct modified internal versions of these rewards based on pre-existing learned representations (e.g. desiring your adopted children’s success) - is that what you’re pointing at?
Generally interested to hear more of the “bad predictions” this model makes.
What do you mean by “inner learned reward”? This post points out that even if humans were “pure RL agents”, we shouldn’t expect them to maximize their own reward. Maybe you mean “inner mesa objectives”?
That’s the foundational assumption of the shard theory that this sequence is introducing, yes. Here’s the draft of a fuller overview that goes into some detail as to how that’s supposed to work. (Uh, to avoid confusion: I’m not affiliated with the theory. Just spreading information.)
I would disagree that it is an assumption. That same draft talks about the outsized role of self-supervised learning on determining particular ordering and kinds of concepts that humans desires latch onto. Learning from reinforcement is a core component in value formation (under shard theory), but not the only one.
I’ve updated the post to clarify. I think focus on “antecedent computation reinforcement” (while often probably ~accurate) was imprecise/wrong for reasons like this. I now instead emphasize that the math of policy gradient approaches means that reward chisels cognitive circuits into networks.