If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction.
I don’t think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like “infer the existence / true nature of distant latent generators that explain your observations” are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).
Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.
I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too “distant”/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans’ labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It’s like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it’s grabbing the ball.
I think something like what you’re describing does occur, but my view of SGD is that it’s more “ensembly” than that. Rather than “the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard”, I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).
Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.
I could imagine a story where it matters—e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that’s a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn’t put much confidence in that argument.
… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
On the other hand, consider a more traditional “ensemble”, in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”, so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of “iteration” system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.
If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.
if every shard has a veto over plans, and the shards are individually quite intelligent subagents
I think this won’t happen FWIW.
and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it’s engaging with what you had in mind)
I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”
What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.
We aren’t in the prediction regime, insofar as that is supposed to be relevant for your argument. Let’s talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)
Can you give me a concrete example of an “exploiting shard” in this situation which is learnable early on, relative to the actual diamond-shards?
And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there’s a chance it exists at low weight or something.
… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
I read this as “the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual ‘Goodhart’ problem where highly rated plans are systematically bad and not what you wanted.” I disagree with the conclusion, at least for many kinds of “imperfections.”
Below is one shot at instantiating the failure mode you’re describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn’t meant as a “slam dunk case closed”, but hopefully something which helps you understand how I’m thinking about the issue and why I don’t anticipate “and then the shards get Goodharted.”
Example shard-Goodharting scenario. The AI bids for plans which it thinks lead to diamonds, except that also, the subcircuit of the policy network which computes the relevant diamond abstraction—this is only a “proxy” for a reliable diamond abstraction. Historically unknown to the AI until the end of its training, that subcircuit (for some reason) activates very strongly for plans which lead to certain diamond-shaped formations of bacteria on the third Tuesday of the year.
Then this shard can be “goodharted” by actions which involve the creation of these bacteria diamonds at that time. There’s a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn’t pursue that plan.
Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don’t have to be “globally robust” or “perfect.”
Values steer optimization; they are not optimized against. The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
Since values are not the optimization target of the agent with those values, the values don’t have to be adversarially robust.
Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values. In self-reflective agents which can think about their own thinking, values steer e.g. what plans get considered next. Therefore, these agents convergently avoid adversarial inputs to their currently activated values (e.g. learning), because adversarial inputs would impede fulfillment of those values (e.g. lead to less learning).
This suggests “and so what is an ‘adversarial input’ to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?”. I haven’t answered that question yet on an intensional basis, but it seems tractable.
I don’t think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like “infer the existence / true nature of distant latent generators that explain your observations” are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).
Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.
I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too “distant”/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans’ labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It’s like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it’s grabbing the ball.
I think something like what you’re describing does occur, but my view of SGD is that it’s more “ensembly” than that. Rather than “the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard”, I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).
Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.
Why does the ensembling matter?
I could imagine a story where it matters—e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that’s a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn’t put much confidence in that argument.
… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
On the other hand, consider a more traditional “ensemble”, in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”, so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of “iteration” system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
What about this post?
Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.
If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.
I think this won’t happen FWIW.
Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it’s engaging with what you had in mind)
What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.
We aren’t in the prediction regime, insofar as that is supposed to be relevant for your argument. Let’s talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)
Can you give me a concrete example of an “exploiting shard” in this situation which is learnable early on, relative to the actual diamond-shards?
The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there’s a chance it exists at low weight or something.
I read this as “the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual ‘Goodhart’ problem where highly rated plans are systematically bad and not what you wanted.” I disagree with the conclusion, at least for many kinds of “imperfections.”
Below is one shot at instantiating the failure mode you’re describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn’t meant as a “slam dunk case closed”, but hopefully something which helps you understand how I’m thinking about the issue and why I don’t anticipate “and then the shards get Goodharted.”
Then this shard can be “goodharted” by actions which involve the creation of these bacteria diamonds at that time. There’s a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn’t pursue that plan.
This was one of the main ideas I discussed in Alignment allows “nonrobust” decision-influences and doesn’t require robust grading:
This suggests “and so what is an ‘adversarial input’ to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?”. I haven’t answered that question yet on an intensional basis, but it seems tractable.