Thanks for the post; disagree with implications as I understand them.
My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.
Let me give an example.
If we are trying to teach an agent some class of behaviors/beliefs using feedback, the feedback may be consistent with what we are actually trying to teach, but it will also be consistent with precisely modeling the feedback process.
A model which understands the feedback process in detail, and identifies “maximizing good feedback” as the goal, will plausibly start trying to manipulate that feedback. This could mean wireheading, human manipulation, or other similar strategies. In the ELK document, the “human simulator” class of counterexamples represents this failure mode.
Since this is such a common counterexample, it seems like any robust plan for AI safety needs to establish confidently that this won’t occur.
I currently conjecture that an initialization from IID self-supervised- and imitation-learning data will not be modelling its own training process in detail, as opposed to knowing about training processes in general (especially if we didn’t censor that in its corpus). Then we reward the agent for approaching a diamond, but not for other objects. What motivational circuits get hooked up? This is a question of inductive biases and details in ML and the activations in the network at the time of decision-making. It sure seems to me like if you can ensure the initial updates hook in to a diamond abstraction and not a training-process abstraction, the future values will gradient-starve the training-process circuit from coming into existence.
Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn’t necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal. Perhaps this relatively simple context (just doing basic RL off of an initialization) is the most natural and appropriate context in which to solve the issue. And I often perceive builder-breaker arguments as concluding something like “OK obviously this means basic RL won’t work” as opposed to “what parameter settings would make it be true or false that the AI will precisely model the feedback process?”
The former response conditions on a speculative danger in a way which assumes away the most promising solutions to the problem (IMO). And if you keep doing that, you get somewhere really weird (IMO). You seem to address a related point:
One might therefore protest: “Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what’s going on, before we can do robust engineering.”
However, it’s also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It’s just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.
But then you later say:
Point out implausible assumptions via plausible counterexamples.
In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we’d like our precious few assumptions to be?
But if we’re still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities. But if we’re assessing the promise of a given approach for which we can gather more information, then we don’t have to assume our current uncertainty. Like with the above, I think we can do empirical work today to substantially narrow the uncertainty on that kind of question.[1] That is, if our current uncertainty is large and reducible (like in my diamond-alignment story), breaker might push me to prematurely and inappropriately condition on not-that-proposal and start exploring maybe-weird, maybe-doomed parts of the solution space as I contort myself around the counterarguments.
Minor notes:
I could easily forgive someone for reading a bunch of AI alignment literature and thinking “AI alignment researchers seem confident that reinforcement learners will wirehead.”. This confusion comes from interpreting Breaker-type statements as confident predictions.
I would mostly say “AI alignment researchers seemed-to-me to believe that either you capture what you want in a criterion, and then get the agent to optimize that criterion; or the agent fails to optimize that criterion, wants some other bad thing, and kills you instead.” That although they do not think that reward is in fact or by default the optimization target of the agent—that they seem to think reward should be embodying what is right to do, in a deep and general and robust sense.
Someone might try to come up with alignment plans which leverage the fact that RL agents wirehead, which imho would be approximately as doomed as a plan which assumed agents wouldn’t.
Really? Are the failures equiprobable? Independent of that, the first one seems totally doomed to me if you’re going to do anything with RL.
Take an imitation learning-initialized agent, do saliency + interpretability to locate its initial decision-relevant abstractions, see how RL finetuning hooks up the concepts and whether it accords with expectations about e.g. the NTK eigenspectrum.
My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.
On my understanding, the thing to do is something like heuristic search, where “expanding a node” means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of different segments of the territory, and refine them to the point of certainty.
So when you say “encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space”, my first thought is that you missed the part where Builder can respond to Breaker’s purported counterexamples with arguments such as the ones you suggest:
I currently conjecture that [...]
Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn’t necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal.
But, perhaps more plausibly, you didn’t miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.
I’m tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself—why should builder/breaker be biased in this way?
You seem to address a related point:
One might therefore protest: “Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what’s going on, before we can do robust engineering.”
However, it’s also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It’s just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.
But then you later say:
Point out implausible assumptions via plausible counterexamples.
In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we’d like our precious few assumptions to be?
But if we’re still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities.
I admit that I do not yet understand your critique at all—what is being conflated?
Here is how I see it, in some detail, in the hopes that I might explicitly write down the mistaken reasoning step which you object to, in the world where there is such a step.
We have our current beliefs, and we can also refine those beliefs over time through observation and argument.
Sometimes it is appropriate to “go with your gut”, choosing the highest-expectation plan based on your current guesses. Sometimes it is appropriate to wait until you have a very well-argued plan, with very well-argued probabilities, which you don’t expect to easily move with a few observations or arguments. Sometimes something in the middle is appropriate.
AI safety is in the “be highly rigorous” category. This is mostly because we can easily imagine failure being so extreme that humanity in fact only gets one shot at this.
When the final goal is to put together such an argument, it makes a lot of sense to have a sub-process which illustrates holes in your reasoning by pointing out counterexamples. It makes a lot of sense to keep a (growing) list of counterexample types.
It being virtually impossible to achieve certainty that we’ll avert catastrophe, our arguments will necessarily include probabilistic assumptions and probabilistic arguments.
#5 does not imply, or excuse, heuristic informality in the final arguments; we want the final arguments to be well-specified, so that we know precisely what we have to assume and precisely what we get out of it.
#5 does, however, mean that we have an interest in plausible counterexamples, not just absolute worst-case reasoning. If I say (as Builder) “one of the coin-flips will come out heads”, as part of an informal-but-working-towards-formality argument, and Breaker says “counterexample, they all come out tails”, then the right thing to do is to assess the probability. If we’re flipping 10 coins, maybe Breaker’s counterexample is common enough to be unacceptably worrying, damning the specific proposal Builder was working on. If we’re flipping billions of coins, maybe Breaker’s counterexample is not probable enough to be worrying.
This is the meaning of my comment about pointing out insufficiently plausible assumptions via plausible counterexamples, which you quote after “But then later you say:”, and of which you state that I seem to conflate two roles.
But if we’re assessing the promise of a given approach for which we can gather more information, then we don’t have to assume our current uncertainty. Like with the above, I think we can do empirical work today to substantially narrow the uncertainty on that kind of question.[1] That is, if our current uncertainty is large and reducible (like in my diamond-alignment story), breaker might push me to prematurely and inappropriately condition on not-that-proposal and start exploring maybe-weird, maybe-doomed parts of the solution space as I contort myself around the counterarguments.
I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means.
On my understanding, if Breaker uncovers an assumption which can be empirically tested, Builder’s next move in the game can be to go test that thing.
However, I admit to having a bias against empirical stuff like that, because I don’t especially see how to generalize observations made today to the highly capable systems of the future with high confidence.
WRT your example, I intuit that perhaps our disagreement has to do with …
I currently conjecture that an initialization from IID self-supervised- and imitation-learning data will not be modelling its own training process in detail,
I think it’s pretty sane to conjecture this for smaller-scale networks, but at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, “incentives”, ie, gradients which actually specifically point in the desired direction and away from the undesired direction).
I think this is a pretty general pattern—like, a lot of your beliefs fit with a picture where there’s a continuous (and relatively homogenious) blob in mind-space connecting humans, current ML, and future highly capable systems. A lot of my caution stems from being unwilling to assume this, and skeptical that we can resolve the uncertainty there by empirical means. It’s hard to empirically figure out whether the landscape looks similar or very different over the next hill, by only checking things on this side of the hill.
Ideally, nothing at all; ie, don’t create powerful AGI, if that’s an option. This is usually the correct answer in similar cases. EG, if you (with no training in bridge design) have to deliver a bridge design that won’t fall over, drawing up blueprints in one day’s time, your best option is probably to not deliver any design. But of course we can arrange the thought-experiment such that it’s not an option.
Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much!
So when you say “encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space”, my first thought is that you missed the part where Builder can respond to Breaker’s purported counterexamples with arguments such as the ones you suggest
Let me see how this would work.
Breaker: “The agent might wirehead because caring about physical reward is a high-reward policy on training”
Builder: “Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target.”
Breaker: “So are we assuming a policy gradient-like algorithm for the RL finetuning?”
Builder: “Sure.”
Breaker: “What if there’s a subnetwork which is a reward maximizer due to LTH?”
...
If that’s how it might go, then sure, this seems productive.
But, perhaps more plausibly, you didn’t miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.
I’m tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself—why should builder/breaker be biased in this way?
I don’t think I was mentally distinguishing between “the idealized builder-breaker process” and “the process as TurnTrout believes it to be usually practiced.” I think you’re right, I should be critiquing the latter, but not necessarily how youin particular practice it, I don’t know much about that. I’m critiquing my own historical experience with the process as I imperfectly recall it.
I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means.
Yes, I think this was most of my point. Nice summary.
at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, “incentives”, ie, gradients which actually specifically point in the desired direction and away from the undesired direction).
I expect this argument to not hold, but I’m not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it’s true that LTH probabilistically ensures the existence of undesired-subnetwork,
Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Probably there’s some evidence from neural selectionism which is relevant here, but not sure which direction.
Seems like the most significant remaining disagreement (perhaps).
1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).[1]
I think solomonoff-style program simplicity probably doesn’t do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe.
I think circuit-simplicity doesn’t do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way.
I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.
In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.
2. You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
I don’t really see your argument here? How does (identifiability issues → (argument is wrong ∨ training-process-optimization is unavoidable ∨ we can somehow make it not apply to networks of AGI size))?
In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg “here is the one correct network”), although I do not fully expect that.
A ‘good prior’ would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)
But, none of this is necc. interfacing with your intended argument.
3. Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Here’s how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what’s valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here—something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.
A bargaining-based agent, on the other hand, can “value several things” in a more significant sense. Simple example:
U1(1)=1,U1(2)=2,U2(1)=2,U2(2)=1
U1 and U2 are almost equallyprobable hypotheses about what to value.
EU maximization maximizes whichever happens to be slightly more probable.
Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.
In order to mitigate risks due to bad hypotheses, we want more “bargaining-like” behavior, rather than “EU-like” behavior.
I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that’s part of your intended implication?
We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I’m not going to defend the inner/outer distinction here; “is wireheading an inner alignment problem, or an outer alignment problem?” is a problematic question.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want)
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
I don’t really see your argument here?
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally anyloss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
Thanks for the post; disagree with implications as I understand them.
My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.
Let me give an example.
I currently conjecture that an initialization from IID self-supervised- and imitation-learning data will not be modelling its own training process in detail, as opposed to knowing about training processes in general (especially if we didn’t censor that in its corpus). Then we reward the agent for approaching a diamond, but not for other objects. What motivational circuits get hooked up? This is a question of inductive biases and details in ML and the activations in the network at the time of decision-making. It sure seems to me like if you can ensure the initial updates hook in to a diamond abstraction and not a training-process abstraction, the future values will gradient-starve the training-process circuit from coming into existence.
Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn’t necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal. Perhaps this relatively simple context (just doing basic RL off of an initialization) is the most natural and appropriate context in which to solve the issue. And I often perceive builder-breaker arguments as concluding something like “OK obviously this means basic RL won’t work” as opposed to “what parameter settings would make it be true or false that the AI will precisely model the feedback process?”
The former response conditions on a speculative danger in a way which assumes away the most promising solutions to the problem (IMO). And if you keep doing that, you get somewhere really weird (IMO). You seem to address a related point:
But then you later say:
But if we’re still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities. But if we’re assessing the promise of a given approach for which we can gather more information, then we don’t have to assume our current uncertainty. Like with the above, I think we can do empirical work today to substantially narrow the uncertainty on that kind of question.[1] That is, if our current uncertainty is large and reducible (like in my diamond-alignment story), breaker might push me to prematurely and inappropriately condition on not-that-proposal and start exploring maybe-weird, maybe-doomed parts of the solution space as I contort myself around the counterarguments.
Minor notes:
I would mostly say “AI alignment researchers seemed-to-me to believe that either you capture what you want in a criterion, and then get the agent to optimize that criterion; or the agent fails to optimize that criterion, wants some other bad thing, and kills you instead.” That although they do not think that reward is in fact or by default the optimization target of the agent—that they seem to think reward should be embodying what is right to do, in a deep and general and robust sense.
Really? Are the failures equiprobable? Independent of that, the first one seems totally doomed to me if you’re going to do anything with RL.
Take an imitation learning-initialized agent, do saliency + interpretability to locate its initial decision-relevant abstractions, see how RL finetuning hooks up the concepts and whether it accords with expectations about e.g. the NTK eigenspectrum.
On my understanding, the thing to do is something like heuristic search, where “expanding a node” means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of different segments of the territory, and refine them to the point of certainty.
So when you say “encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space”, my first thought is that you missed the part where Builder can respond to Breaker’s purported counterexamples with arguments such as the ones you suggest:
But, perhaps more plausibly, you didn’t miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.
I’m tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself—why should builder/breaker be biased in this way?
I admit that I do not yet understand your critique at all—what is being conflated?
Here is how I see it, in some detail, in the hopes that I might explicitly write down the mistaken reasoning step which you object to, in the world where there is such a step.
We have our current beliefs, and we can also refine those beliefs over time through observation and argument.
Sometimes it is appropriate to “go with your gut”, choosing the highest-expectation plan based on your current guesses. Sometimes it is appropriate to wait until you have a very well-argued plan, with very well-argued probabilities, which you don’t expect to easily move with a few observations or arguments. Sometimes something in the middle is appropriate.
AI safety is in the “be highly rigorous” category. This is mostly because we can easily imagine failure being so extreme that humanity in fact only gets one shot at this.
When the final goal is to put together such an argument, it makes a lot of sense to have a sub-process which illustrates holes in your reasoning by pointing out counterexamples. It makes a lot of sense to keep a (growing) list of counterexample types.
It being virtually impossible to achieve certainty that we’ll avert catastrophe, our arguments will necessarily include probabilistic assumptions and probabilistic arguments.
#5 does not imply, or excuse, heuristic informality in the final arguments; we want the final arguments to be well-specified, so that we know precisely what we have to assume and precisely what we get out of it.
#5 does, however, mean that we have an interest in plausible counterexamples, not just absolute worst-case reasoning. If I say (as Builder) “one of the coin-flips will come out heads”, as part of an informal-but-working-towards-formality argument, and Breaker says “counterexample, they all come out tails”, then the right thing to do is to assess the probability. If we’re flipping 10 coins, maybe Breaker’s counterexample is common enough to be unacceptably worrying, damning the specific proposal Builder was working on. If we’re flipping billions of coins, maybe Breaker’s counterexample is not probable enough to be worrying.
This is the meaning of my comment about pointing out insufficiently plausible assumptions via plausible counterexamples, which you quote after “But then later you say:”, and of which you state that I seem to conflate two roles.
I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means.
On my understanding, if Breaker uncovers an assumption which can be empirically tested, Builder’s next move in the game can be to go test that thing.
However, I admit to having a bias against empirical stuff like that, because I don’t especially see how to generalize observations made today to the highly capable systems of the future with high confidence.
WRT your example, I intuit that perhaps our disagreement has to do with …
I think it’s pretty sane to conjecture this for smaller-scale networks, but at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, “incentives”, ie, gradients which actually specifically point in the desired direction and away from the undesired direction).
I think this is a pretty general pattern—like, a lot of your beliefs fit with a picture where there’s a continuous (and relatively homogenious) blob in mind-space connecting humans, current ML, and future highly capable systems. A lot of my caution stems from being unwilling to assume this, and skeptical that we can resolve the uncertainty there by empirical means. It’s hard to empirically figure out whether the landscape looks similar or very different over the next hill, by only checking things on this side of the hill.
Ideally, nothing at all; ie, don’t create powerful AGI, if that’s an option. This is usually the correct answer in similar cases. EG, if you (with no training in bridge design) have to deliver a bridge design that won’t fall over, drawing up blueprints in one day’s time, your best option is probably to not deliver any design. But of course we can arrange the thought-experiment such that it’s not an option.
Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much!
Let me see how this would work.
Breaker: “The agent might wirehead because caring about physical reward is a high-reward policy on training”
Builder: “Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target.”
Breaker: “So are we assuming a policy gradient-like algorithm for the RL finetuning?”
Builder: “Sure.”
Breaker: “What if there’s a subnetwork which is a reward maximizer due to LTH?”
...
If that’s how it might go, then sure, this seems productive.
I don’t think I was mentally distinguishing between “the idealized builder-breaker process” and “the process as TurnTrout believes it to be usually practiced.” I think you’re right, I should be critiquing the latter, but not necessarily how you in particular practice it, I don’t know much about that. I’m critiquing my own historical experience with the process as I imperfectly recall it.
Yes, I think this was most of my point. Nice summary.
I expect this argument to not hold, but I’m not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it’s true that LTH probabilistically ensures the existence of undesired-subnetwork,
Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Probably there’s some evidence from neural selectionism which is relevant here, but not sure which direction.
Seems like the most significant remaining disagreement (perhaps).
So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).[1]
I think solomonoff-style program simplicity probably doesn’t do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe.
I think circuit-simplicity doesn’t do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way.
I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.
In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.
I don’t really see your argument here? How does (identifiability issues → (argument is wrong ∨ training-process-optimization is unavoidable ∨ we can somehow make it not apply to networks of AGI size))?
In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg “here is the one correct network”), although I do not fully expect that.
A ‘good prior’ would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)
But, none of this is necc. interfacing with your intended argument.
Here’s how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what’s valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here—something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.
A bargaining-based agent, on the other hand, can “value several things” in a more significant sense. Simple example:
U1(1)=1,U1(2)=2,U2(1)=2,U2(2)=1
U1 and U2 are almost equally probable hypotheses about what to value.
EU maximization maximizes whichever happens to be slightly more probable.
Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.
In order to mitigate risks due to bad hypotheses, we want more “bargaining-like” behavior, rather than “EU-like” behavior.
I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that’s part of your intended implication?
We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I’m not going to defend the inner/outer distinction here; “is wireheading an inner alignment problem, or an outer alignment problem?” is a problematic question.
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
I also think this argument is bogus, to be clear.