Builder/Breaker for Deconfusion
This is something of a grab-bag of thoughts I’ve had about the Builder/Breaker game.
The ELK document had a really nice explanation of its research methodology in terms of an imaginary dialogue between a “Builder” who makes positive proposals, and a “Breaker” who tries to break them. To an extent, this is just the ordinary philosophical method,[1] and also a common pattern in other research areas. However, I felt that the explicit write-up helped to clarify some things for me.
We might think of the Builder/ Breaker game as an adversarial game where either the builder or breaker “wins”, like AI debate. However, I find it more fruitful to think of it as a cooperative game. When the game is played by AI safety researchers, the players have a common goal of finding robust plans to avoid catastrophic outcomes. The builder/breaker game merely organizes cognitive work: both Builder and Breaker are trying to map the space of proposals, but each takes primary responsibility for avoiding a different kind of error (false positives vs false negatives).
Security Mindset
I think Builder/Breaker is a good way to understand Eliezer’s notion of security mindset (1, 2). The Builder is trying to construct a positive argument for safety, with (at least[2]) the following good properties:
The argument clearly states its assumptions.
Each assumption is as plausible as possible (because any grain of doubt indicates a possibility of failure).
There are as few assumptions as possible (because more assumptions mean more ways the plan can fail).
Each step of reasoning is sound.
The conclusion of the argument is a meaningful safety guarantee.
I will call such a plan robust. We can question whether AI safety research should focus on robust plans. I won’t dwell on this question too much. Clearly, some endeavors require robust plans, while others do not. AI safety seems to me like a domain which requires robust plans. I’ll leave it at that for now.[3]
In any case, coming up with robust plans has proven difficult. The Builder/Breaker game allows us to incrementally make progress, by mapping the space of possibilities and marking regions which won’t work.
Example: Wireheading
I could easily forgive someone for reading a bunch of AI alignment literature and thinking “AI alignment researchers seem confident that reinforcement learners will wirehead.”. This confusion comes from interpreting Breaker-type statements as confident predictions.
(Someone might try to come up with alignment plans which leverage the fact that RL agents wirehead, which imho would be approximately as doomed as a plan which assumed agents wouldn’t. Breaker start saying “What if the agent doesn’t wirehead?” instead of “What if the agent wireheads?”.)
Reward is not the optimization target. The point isn’t that RL agents necessarily wirehead. The point is that reinforcement signals cannot possibly rule out wireheaders.
This is an example of a very important class of counterexamples. If we are trying to teach an agent some class of behaviors/beliefs using feedback, the feedback may be consistent with what we are actually trying to teach, but it will also be consistent with precisely modeling the feedback process.
A model which understands the feedback process in detail, and identifies “maximizing good feedback” as the goal, will plausibly start trying to manipulate that feedback. This could mean wireheading, human manipulation, or other similar strategies. In the ELK document, the “human simulator” class of counterexamples represents this failure mode.
Since this is such a common counterexample, it seems like any robust plan for AI safety needs to establish confidently that this won’t occur.[4]
(It also happens that we have empirical evidence showing that this kind of thing can actually happen in some cases; but, I would still be concerned that it could happen for highly capable systems, even if nothing similar had ever been observed.)
Probabilistic Reasoning
The ELK document describes Builder/Breaker in service of worst-case reasoning; we want to solve ELK in the worst case if we can do so. This means any counterexample is fair game, no matter how improbable.
One might therefore protest: “Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what’s going on, before we can do robust engineering.”
However, it’s also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It’s just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.
Breaker’s job remains the same: finding counterexamples to Builder’s proposals. If Builder thinks a counterexample is improbable, then Builder should make explicit assumptions to probabilistically rule it out.
Logical Uncertainty
Breaker’s job is twofold:
Point out implausible assumptions via plausible counterexamples.
In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we’d like our precious few assumptions to be? (This reasoning should also take into account the tendency for a single counterexample to suggest the possibility of more; the provided counterexample might in itself be improbable, but it might be obvious that Breaker could spell out many other counterexamples like that, which could be collectively too probable to dismiss.)
Point out holes in the argument, by suggesting examples which seem consistent with the assumptions but which lead to bad outcomes.
When doing the second job, Breaker doesn’t have to be logically omniscient; Breaker just needs to find a hole in Builder’s proof. Builder then tries to fill in the hole, either by making more detailed arguments from existing assumptions, or by making more assumptions explicit.
Approaching Clarity
One reason why I think of Builder/Breaker as a cooperative game is because I think Breaker should try to provide helpful critiques. Counterexamples should strike at the heart of a proposal, meaning, they should rule out as many similar proposals as possible.
When it’s going well, Builder/Breaker naturally moves in the direction of more detailed arguments. If Builder offers an informal proof sketch and Breaker makes a fiddly technical objection, that’s a good sign: it means Breaker thinks the informal plan is plausible on its own terms, and so, needs to be further formalized in order to be judged properly. If Breaker thought the whole plan seemed doomed without filling in those details, Breaker should have produced a counterexample illustrating that, if possible.[5]
In other words: the Builder/Breaker game has a natural “early game” (when plans and objections are very informal), and “late game” (when plans and objections are very formal).
This idea can help “unify” the Paul-ish approach to AI safety and the MIRI-ish approach. (I would advise caution applying this to actual Paul or actual MIRI, but I think it does capture something.) The Paul-ish approach focuses on making concrete proposals, trying to spell out safety arguments, finding counterexamples which break those arguments, and then using this to inform the next iteration. The MIRI-ish approach focuses more on deconfusion.
The Rocket Alignment Problem argues for deconfusion work through the analogy of rocket science. However, (I claim,) you can analyze most of the mistakes the Alfonzo character makes by “Alfonzo doesn’t understand the Builder/Breaker game”.
I see deconfusion work as saying something like: “When we try to play Builder/Breaker, we notice specific terms popping up again and again. Terms like ‘optimization’ and ‘agent’ and ‘values’ and ‘beliefs’. It seems like our confusion about those terms is standing in the way of progress.”
The MIRI-ish path reflects the common saying that if you can clearly state the problem, you’re halfway to a solution. The Paul-ish path doesn’t abandon the idea of clearly stating the problem, but emphasizes iteration on partial solutions as a way to achieve the needed clarity.
Exercize:
Play the Builder/Breaker game yourself, with avoiding AI X-risk as the top-level goal. (Or, whichever statement of AI risk / AI alignment / AI control problem / etc seems right to you.)
If you make it to the point where you have a vague plausible plan stated in English:
What terms do you need to define more rigorously, before you can fill in more details of the plan or judge it properly?
Try to operationalize those terms with more technical definitions. Do you run into more concepts which you need to be more deconfused about before proceeding?
You might want to try this exercise before reading the next section, if you want to avoid being influenced by other ideas.
Breaking Down Problems
You could say that the field of AI safety / AI alignment / whatever-we-call-it-these-days has a number of established sub-problems, eg:
Value loading.
Reward hacking.
Impact measures.
Inner alignment.
Ontological crises.
etc.
However, there’s not some specific plan which fits all of these parts together into a coherent whole. This means that if you choose one item from the list and try to work on it, you can’t be very confident that your work eventually contributes to a robust plan.
This is part of the advantage of playing Builder/Breaker on the whole alignment problem, at least for a little while, before settling in on a specific sub-problem. It helps give you a sense of what overall plans you might be trying to fit your research into. (Of course, Builder/Breaker might also be a useful way to make progress on your sub-problem; this was the way ELK used it. But, this is different from playing Builder/Breaker for the whole problem.)
In other words: we can’t necessarily correctly solve a sub-problem from first principles, and then expect the solution to fit correctly into an overall plan. Often, it will be necessary to solve a sub-problem in a way that’s aware of the overall plan it needs to fit into.
(If we stated the sub-problems carefully enough, this would not be a problem; however, because we are still confused about many of the key concepts, these problems are best stated informally to allow for multiple possible formalizations.)
So, what are some actual high-level plans which break the problem into sub-problems which do add up to a solution?
Two such plans are Evan’s and Rohin’s. I wrote about these two plans last year. There was also a recent review of the differences.
Here is a rough sketch of the two plans:
These are not “robust plans” in the sense I defined earlier, since they are extremely vague and success relies on conditions which we don’t know how to achieve. The point is that both are sketches of what robust plans might look like, such that we can see how the various sub-problems need to fit together in order to add up to something good.
My main point here is, high-level plans help us zoom in on terms which deconfusion work should focus on. I think it’s fine and important to be curiosity-driven and to say “concept X just seems somehow important here”—I’m not necessarily saying that you should drop your pet project to deconfuse “consciousness” or whatever. But to the extent that you try let your research be guided by explicit reason, I think it makes a lot of sense to play builder/breaker to try to refine high-level plans like this, and then try to deconfuse the vague terminology and intuitions involved in your high-level argument.
Building Up Solutions
In Why Agent Foundations, John justifies deconfusion work as follows:
He names Goodhart’s Law as the main reason why most would-be alignment proposals fail, justifying this with an example. The analysis is somewhat along the lines of Rohin’s view from the previous section.
He introduces the concept of “true names”: concepts which don’t fall apart under optimization pressure.
On his view, the aim of deconfusion work is to find a set of useful “true names” relating to AI x-risk, so that we can build solutions which don’t fall apart when a huge amount of optimization pressure is applied to them.
I don’t think that this is wrong, exactly, but it sounds like magic. I also find it to be a bit restrictive. For example, I think Quantilizers are a venerable illustration of the right way of doing things:
It sets the target at “avoid catastrophe”, while making as few assumptions about what “catastrophe” means as possible. This is good, because as I mentioned earlier, assumptions are opportunities to be wrong. We would like to “avoid catastrophe” in as broad and vague a sense as we can get away with, while still establishing strong results which we think apply in the real world.
Under some assumptions, which might possibly be achievable via human effort, it gives us a meaningful guarantee with regards to avoiding catastrophe!
However, Quantilizers escape the letter of the law for John’s “true names”, because they explicitly do fall apart if too much optimization power is employed. Instead, we get a theory in which “too much optimization” is rigorously defined and avoided.
So, instead of John’s “true names” concept, I want to rely on the rough claim I highlighted earlier, that clearly stating a problem is often 50% of the work.
Instead of “true names”, we are looking for sufficiently robust descriptions of the nature of the universe, which we can use in our robust plans.
- ^
Especially “analytic philosophy”.
- ^
We might define something like a “safety margin” as the number of our confident assumptions which can fail, without compromising the argument. For example, if you’ve got 3 assumptions and 3 different safety arguments, each of which use a different 2 of the 3 assumptions, your safety margin is 1, because you can delete any 1 assumption and still have a strong argument left. This captures the idea that redundant plans are safer. We would love to have even a single AI safety measure with a single confident argument for its adequacy. However, this only gets us to safety margin zero.
Once we have any safety argument at all, we can then try to improve the safety margin.
The risk of assigning numbers is that it’ll devolve into complete BS. It’s easy to artificially increase the safety margin of a plan by lowering your standards—a paper might estimate an impressive safety margin of 6, but when you dig into the details, none of the supposed safety arguments are conclusive by your own standards.
- ^
This is essentially the question of arguing concretely for AI risk. If you are skeptical of risk arguments, you’ll naturally be skeptical of the idea that AI safety researchers need to look for “robust plans” of the kind builder/breaker helps find.
- ^
Reinforcement Learning with a Corrupted Reward Channel by Everitt et al makes significant headway on this problem, proposing that feedback systems need to give feedback on other states than the current one. In ordinary RL, you only ever get feedback on the current situation you’re in. This means you can never learn for sure that “it’s bad to corrupt your reward signal”—you can never experience anything inconsistent with the hypothesis “utility is (the discounted future sum over) whatever number the reward circuit outputs”.
If humans are able to give feedback on hypothetical states, however, we can create a hypothetical where the agent manipulates its feedback signal, and assign a low value to that state.
Unfortunately, this does not completely rule out the more general counterexample strategy! Breaker might still be able to use the “human simulation” style counterexamples discussed in the ELK document. To name a concrete problem: if the system is judged by humans in a different state than the current one, the system might try to manipulate those other humans, rather than the current humans, which could still be bad. So the builder/breaker game continues.
- ^
Of course, Breaker may have a vague sense that the whole plan is doomed, but only be able to spot fiddly technical objections. If Breaker is wrong about the vague feelings, the technical objections are useful anyway. And if Breaker is right about the vague feelings, well, at least Breaker can slowly rule out each specific proposal Builder makes, by spotting technical flaws. This is a fine type of progress, even if slow.
- An Exercise to Build Intuitions on AGI Risk by 7 Jun 2023 18:35 UTC; 52 points) (
- 14 Jun 2023 3:15 UTC; 14 points) 's comment on Critiques of prominent AI safety labs: Conjecture by (EA Forum;
- [Job Ad] SERI MATS is (still) hiring for our summer program by 6 Jun 2023 21:07 UTC; 12 points) (
- [Job Ad] MATS is hiring! by 9 Oct 2024 2:17 UTC; 10 points) (
- [Job Ad] SERI MATS is hiring for our summer program by 26 May 2023 4:51 UTC; 8 points) (EA Forum;
- 24 Apr 2023 2:05 UTC; 6 points) 's comment on An open letter to SERI MATS program organisers by (
- An Exercise to Build Intuitions on AGI Risk by 8 Jun 2023 11:20 UTC; 4 points) (EA Forum;
- 4 May 2023 18:27 UTC; 3 points) 's comment on How MATS addresses “mass movement building” concerns by (EA Forum;
- 24 Apr 2023 15:37 UTC; 1 point) 's comment on An open letter to SERI MATS program organisers by (
- 4 May 2023 18:22 UTC; 1 point) 's comment on How MATS addresses “mass movement building” concerns by (
This point is very interesting (and in my opinion accurate). I agree that Rohin’s and Evan’s plans point in the direction of possible robust breakdowns of the alignment problem. I also have the sense that to this day nobody has definitively broken down the alignment problem into even two absolutely separable sub-problems, in a way that has stood the test of time. I am taking “separable” to mean that someone can work on one subproblem without really considering the other subproblems almost at all. In the broader economy, it seems to be exactly these kind of breakdowns that have allowed humans to be so effective. I have the sense that something about the alignment problem is resistant to breaking it down in this way.
Thanks for the post; disagree with implications as I understand them.
My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.
Let me give an example.
I currently conjecture that an initialization from IID self-supervised- and imitation-learning data will not be modelling its own training process in detail, as opposed to knowing about training processes in general (especially if we didn’t censor that in its corpus). Then we reward the agent for approaching a diamond, but not for other objects. What motivational circuits get hooked up? This is a question of inductive biases and details in ML and the activations in the network at the time of decision-making. It sure seems to me like if you can ensure the initial updates hook in to a diamond abstraction and not a training-process abstraction, the future values will gradient-starve the training-process circuit from coming into existence.
Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn’t necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal. Perhaps this relatively simple context (just doing basic RL off of an initialization) is the most natural and appropriate context in which to solve the issue. And I often perceive builder-breaker arguments as concluding something like “OK obviously this means basic RL won’t work” as opposed to “what parameter settings would make it be true or false that the AI will precisely model the feedback process?”
The former response conditions on a speculative danger in a way which assumes away the most promising solutions to the problem (IMO). And if you keep doing that, you get somewhere really weird (IMO). You seem to address a related point:
But then you later say:
But if we’re still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities. But if we’re assessing the promise of a given approach for which we can gather more information, then we don’t have to assume our current uncertainty. Like with the above, I think we can do empirical work today to substantially narrow the uncertainty on that kind of question.[1] That is, if our current uncertainty is large and reducible (like in my diamond-alignment story), breaker might push me to prematurely and inappropriately condition on not-that-proposal and start exploring maybe-weird, maybe-doomed parts of the solution space as I contort myself around the counterarguments.
Minor notes:
I would mostly say “AI alignment researchers seemed-to-me to believe that either you capture what you want in a criterion, and then get the agent to optimize that criterion; or the agent fails to optimize that criterion, wants some other bad thing, and kills you instead.” That although they do not think that reward is in fact or by default the optimization target of the agent—that they seem to think reward should be embodying what is right to do, in a deep and general and robust sense.
Really? Are the failures equiprobable? Independent of that, the first one seems totally doomed to me if you’re going to do anything with RL.
Take an imitation learning-initialized agent, do saliency + interpretability to locate its initial decision-relevant abstractions, see how RL finetuning hooks up the concepts and whether it accords with expectations about e.g. the NTK eigenspectrum.
On my understanding, the thing to do is something like heuristic search, where “expanding a node” means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of different segments of the territory, and refine them to the point of certainty.
So when you say “encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space”, my first thought is that you missed the part where Builder can respond to Breaker’s purported counterexamples with arguments such as the ones you suggest:
But, perhaps more plausibly, you didn’t miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.
I’m tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself—why should builder/breaker be biased in this way?
I admit that I do not yet understand your critique at all—what is being conflated?
Here is how I see it, in some detail, in the hopes that I might explicitly write down the mistaken reasoning step which you object to, in the world where there is such a step.
We have our current beliefs, and we can also refine those beliefs over time through observation and argument.
Sometimes it is appropriate to “go with your gut”, choosing the highest-expectation plan based on your current guesses. Sometimes it is appropriate to wait until you have a very well-argued plan, with very well-argued probabilities, which you don’t expect to easily move with a few observations or arguments. Sometimes something in the middle is appropriate.
AI safety is in the “be highly rigorous” category. This is mostly because we can easily imagine failure being so extreme that humanity in fact only gets one shot at this.
When the final goal is to put together such an argument, it makes a lot of sense to have a sub-process which illustrates holes in your reasoning by pointing out counterexamples. It makes a lot of sense to keep a (growing) list of counterexample types.
It being virtually impossible to achieve certainty that we’ll avert catastrophe, our arguments will necessarily include probabilistic assumptions and probabilistic arguments.
#5 does not imply, or excuse, heuristic informality in the final arguments; we want the final arguments to be well-specified, so that we know precisely what we have to assume and precisely what we get out of it.
#5 does, however, mean that we have an interest in plausible counterexamples, not just absolute worst-case reasoning. If I say (as Builder) “one of the coin-flips will come out heads”, as part of an informal-but-working-towards-formality argument, and Breaker says “counterexample, they all come out tails”, then the right thing to do is to assess the probability. If we’re flipping 10 coins, maybe Breaker’s counterexample is common enough to be unacceptably worrying, damning the specific proposal Builder was working on. If we’re flipping billions of coins, maybe Breaker’s counterexample is not probable enough to be worrying.
This is the meaning of my comment about pointing out insufficiently plausible assumptions via plausible counterexamples, which you quote after “But then later you say:”, and of which you state that I seem to conflate two roles.
I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means.
On my understanding, if Breaker uncovers an assumption which can be empirically tested, Builder’s next move in the game can be to go test that thing.
However, I admit to having a bias against empirical stuff like that, because I don’t especially see how to generalize observations made today to the highly capable systems of the future with high confidence.
WRT your example, I intuit that perhaps our disagreement has to do with …
I think it’s pretty sane to conjecture this for smaller-scale networks, but at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, “incentives”, ie, gradients which actually specifically point in the desired direction and away from the undesired direction).
I think this is a pretty general pattern—like, a lot of your beliefs fit with a picture where there’s a continuous (and relatively homogenious) blob in mind-space connecting humans, current ML, and future highly capable systems. A lot of my caution stems from being unwilling to assume this, and skeptical that we can resolve the uncertainty there by empirical means. It’s hard to empirically figure out whether the landscape looks similar or very different over the next hill, by only checking things on this side of the hill.
Ideally, nothing at all; ie, don’t create powerful AGI, if that’s an option. This is usually the correct answer in similar cases. EG, if you (with no training in bridge design) have to deliver a bridge design that won’t fall over, drawing up blueprints in one day’s time, your best option is probably to not deliver any design. But of course we can arrange the thought-experiment such that it’s not an option.
Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much!
Let me see how this would work.
Breaker: “The agent might wirehead because caring about physical reward is a high-reward policy on training”
Builder: “Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target.”
Breaker: “So are we assuming a policy gradient-like algorithm for the RL finetuning?”
Builder: “Sure.”
Breaker: “What if there’s a subnetwork which is a reward maximizer due to LTH?”
...
If that’s how it might go, then sure, this seems productive.
I don’t think I was mentally distinguishing between “the idealized builder-breaker process” and “the process as TurnTrout believes it to be usually practiced.” I think you’re right, I should be critiquing the latter, but not necessarily how you in particular practice it, I don’t know much about that. I’m critiquing my own historical experience with the process as I imperfectly recall it.
Yes, I think this was most of my point. Nice summary.
I expect this argument to not hold, but I’m not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it’s true that LTH probabilistically ensures the existence of undesired-subnetwork,
Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Probably there’s some evidence from neural selectionism which is relevant here, but not sure which direction.
Seems like the most significant remaining disagreement (perhaps).
So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).[1]
I think solomonoff-style program simplicity probably doesn’t do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe.
I think circuit-simplicity doesn’t do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way.
I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.
In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.
I don’t really see your argument here? How does (identifiability issues → (argument is wrong ∨ training-process-optimization is unavoidable ∨ we can somehow make it not apply to networks of AGI size))?
In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg “here is the one correct network”), although I do not fully expect that.
A ‘good prior’ would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)
But, none of this is necc. interfacing with your intended argument.
Here’s how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what’s valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here—something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.
A bargaining-based agent, on the other hand, can “value several things” in a more significant sense. Simple example:
U1(1)=1,U1(2)=2,U2(1)=2,U2(2)=1
U1 and U2 are almost equally probable hypotheses about what to value.
EU maximization maximizes whichever happens to be slightly more probable.
Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.
In order to mitigate risks due to bad hypotheses, we want more “bargaining-like” behavior, rather than “EU-like” behavior.
I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that’s part of your intended implication?
We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I’m not going to defend the inner/outer distinction here; “is wireheading an inner alignment problem, or an outer alignment problem?” is a problematic question.
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
I also think this argument is bogus, to be clear.
Someone make a PR for a builder/breaker feature on lesswrong