Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much!
So when you say “encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space”, my first thought is that you missed the part where Builder can respond to Breaker’s purported counterexamples with arguments such as the ones you suggest
Let me see how this would work.
Breaker: “The agent might wirehead because caring about physical reward is a high-reward policy on training”
Builder: “Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target.”
Breaker: “So are we assuming a policy gradient-like algorithm for the RL finetuning?”
Builder: “Sure.”
Breaker: “What if there’s a subnetwork which is a reward maximizer due to LTH?”
...
If that’s how it might go, then sure, this seems productive.
But, perhaps more plausibly, you didn’t miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.
I’m tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself—why should builder/breaker be biased in this way?
I don’t think I was mentally distinguishing between “the idealized builder-breaker process” and “the process as TurnTrout believes it to be usually practiced.” I think you’re right, I should be critiquing the latter, but not necessarily how youin particular practice it, I don’t know much about that. I’m critiquing my own historical experience with the process as I imperfectly recall it.
I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means.
Yes, I think this was most of my point. Nice summary.
at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, “incentives”, ie, gradients which actually specifically point in the desired direction and away from the undesired direction).
I expect this argument to not hold, but I’m not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it’s true that LTH probabilistically ensures the existence of undesired-subnetwork,
Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Probably there’s some evidence from neural selectionism which is relevant here, but not sure which direction.
Seems like the most significant remaining disagreement (perhaps).
1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).[1]
I think solomonoff-style program simplicity probably doesn’t do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe.
I think circuit-simplicity doesn’t do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way.
I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.
In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.
2. You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
I don’t really see your argument here? How does (identifiability issues → (argument is wrong ∨ training-process-optimization is unavoidable ∨ we can somehow make it not apply to networks of AGI size))?
In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg “here is the one correct network”), although I do not fully expect that.
A ‘good prior’ would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)
But, none of this is necc. interfacing with your intended argument.
3. Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Here’s how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what’s valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here—something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.
A bargaining-based agent, on the other hand, can “value several things” in a more significant sense. Simple example:
U1(1)=1,U1(2)=2,U2(1)=2,U2(2)=1
U1 and U2 are almost equallyprobable hypotheses about what to value.
EU maximization maximizes whichever happens to be slightly more probable.
Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.
In order to mitigate risks due to bad hypotheses, we want more “bargaining-like” behavior, rather than “EU-like” behavior.
I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that’s part of your intended implication?
We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I’m not going to defend the inner/outer distinction here; “is wireheading an inner alignment problem, or an outer alignment problem?” is a problematic question.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want)
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
I don’t really see your argument here?
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally anyloss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much!
Let me see how this would work.
Breaker: “The agent might wirehead because caring about physical reward is a high-reward policy on training”
Builder: “Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target.”
Breaker: “So are we assuming a policy gradient-like algorithm for the RL finetuning?”
Builder: “Sure.”
Breaker: “What if there’s a subnetwork which is a reward maximizer due to LTH?”
...
If that’s how it might go, then sure, this seems productive.
I don’t think I was mentally distinguishing between “the idealized builder-breaker process” and “the process as TurnTrout believes it to be usually practiced.” I think you’re right, I should be critiquing the latter, but not necessarily how you in particular practice it, I don’t know much about that. I’m critiquing my own historical experience with the process as I imperfectly recall it.
Yes, I think this was most of my point. Nice summary.
I expect this argument to not hold, but I’m not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it’s true that LTH probabilistically ensures the existence of undesired-subnetwork,
Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Probably there’s some evidence from neural selectionism which is relevant here, but not sure which direction.
Seems like the most significant remaining disagreement (perhaps).
So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).[1]
I think solomonoff-style program simplicity probably doesn’t do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe.
I think circuit-simplicity doesn’t do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way.
I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.
In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.
I don’t really see your argument here? How does (identifiability issues → (argument is wrong ∨ training-process-optimization is unavoidable ∨ we can somehow make it not apply to networks of AGI size))?
In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg “here is the one correct network”), although I do not fully expect that.
A ‘good prior’ would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)
But, none of this is necc. interfacing with your intended argument.
Here’s how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what’s valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here—something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.
A bargaining-based agent, on the other hand, can “value several things” in a more significant sense. Simple example:
U1(1)=1,U1(2)=2,U2(1)=2,U2(2)=1
U1 and U2 are almost equally probable hypotheses about what to value.
EU maximization maximizes whichever happens to be slightly more probable.
Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.
In order to mitigate risks due to bad hypotheses, we want more “bargaining-like” behavior, rather than “EU-like” behavior.
I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that’s part of your intended implication?
We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I’m not going to defend the inner/outer distinction here; “is wireheading an inner alignment problem, or an outer alignment problem?” is a problematic question.
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
I also think this argument is bogus, to be clear.