I’ve remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I’ve focused on giving examples, in the hope of getting the vibe across.)
Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it’s well captured by the “grader is complicit” point in my previous comment, which you presumably disagree with).
But I don’t see how to extend the extensional definition far enough to get to the conclusion that IDA, debate, RRM etc aren’t going to work.
self.diamondShardShard wouldn’t be tricking itself, it would be tricking another evaluative module in the AI (i.e. self.diamondShard).[2]
Okay, that makes sense. So then the implementations of the shards would look like:
def diamondShard(conseq):
return conseq.query("Number of diamonds")
def diamondShardShard(conseq):
return conseq.query("Output of diamondGrader")
But ultimately the conseq queries just report the results of whatever cognition the world model does, so these implementations are equivalent to:
Either way you are choosing plans on the basis of the output of some predictive / evaluative model. In the first case the predictive / evaluative model is the world model itself, in the second case it is the composition of the diamond grader and the world model.
It’s not obvious to me that diamondShardShard is terrible for getting diamonds—it depends on what diamondGrader does! If diamondGrader also gets to reflectively consider the plan, and produces low outputs for plans that (it can tell) would lead to it being tricked / replaced in the future, then it seems like it could work out fine.
In either case I think you’re depending on something (either the world model, or the diamond grader) to notice when a plan is going to trick / deceive an evaluator. In both cases all aspects of the planner remain the same (as you suggested you only need to change diamondShard to diamondShardShard rather than changing anything in the planner), so it’s not the adversarial optimization from the planner is different in magnitude. So it seems like it has to come down to some quantitative prediction that diamondGrader is going to be very bad at noticing cases where it’s being tricked.
Other responses that probably aren’t important:
Consider an analogous argument:
“Given gradient descent’s pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not “optimizing the outputs of the loss function as gradient descent’s main terminal motivation”?”[1]
This argument seems right to me (modulo your point in the footnote).
Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than “randomly sample from all global minima in the loss landscape” (analogously: “randomly sample a plan which globally maximizes grader output”).
Hmm, how does it have very different properties? This feels like a decent first-order approximation[1].
Certainly it is not exactly accurate—you could easily construct examples of non-convex loss landscapes where (1) the one global minimum is hidden in a deep narrow valley surrounded in all directions by very high loss and (2) the local minima are qualitatively different from the global minimum.
In both cases all aspects of the planner remain the same (as you suggested you only need to change diamondShard to diamondShardShard rather than changing anything in the planner), so it’s not the adversarial optimization from the planner is different in magnitude.
Having a diamondShard and a diamondGraderShard will mean that the generative models will be differently tuned! Not only does an animal-welfare activist grade plans based on predictions about different latent quantities (e.g. animal happiness) than a businessman (e.g. how well their firm does), the two will sample different plans from self.WM.planModificationSample! The vegan and businessman have different generative models because they historically cared about different quantities, and so collected different training data, which differently refined their predictive and planning machinery…
One of my main lessons was (intended to be) that “agents” are not just a “generative model” and a “grading procedure”, with each slot hot-swappable to different models or graders! One should not reduce a system to “where the plans come from” and “how they get graded”; these are not disjoint slots in practice (even though they are disjoint, in theory). Each system has complex and rich dynamics, and you need to holistically consider what plans get generated and how they get graded in order to properly predict the overall behavior of a system.
To address our running example—if an agent has a diamondGraderShard, that was brought into existence by reinforcement events for making the diamondGrader output a high number. This kind of agent has internalized tricks and models around the diamondGrader in particular, and would e.g. freely generate plans like “study the diamondGrader implementation.”
On the other hand, the diamondShard agent would be tuned to generate plans which have to do with diamonds. It’s still true that an “accidental” / “upwards-noise” generation could trick the internal diamond grader, but there would not have been historical reinforcement events which accrued into internal generative models which e.g. sample plans about doing adversarial attacks on parts of the agent’s own cognition. So I would in fact be surprised to find a free-standing diamond-shard-agent generate the plan “attack the diamondGrader implementation”, but I wouldn’t be that surprised if a diamondGraderShard’s generative model sampled that plan.
So it’s not that the diamondGrader is complicit; diamondGrader doesn’t even get a say under the hypothetical I’m imagining (it’s not a shard, it’s just an evaluative submodule which usually lays dormant). It’s that diamondGraderShard and its corresponding generative model are tuned to exert active optimization power to adversarially attack the diamondGrader.
The reason this applies to approval-directed agents is that we swap out diamondGraderShard for approvalGraderShard and diamondGrader for approvalGrader.
The reason this doesn’t apply to actual humans is that humans only have e.g. happinessShards (to a simplification), and not happinessGraderShards. This matters in part because this means their generative models aren’t tuned to generate plans which exploit the happinessShard, and in other part because the happinessShard isn’t actively bidding for plans where the happinessShard/happinessGrader gets duped but where happiness isn’t straightforwardly predicted by the WM.
Hmm, how does it have very different properties? This feels like a decent first-order approximation
Because SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample.
From Quintin Pope:
In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper: https://arxiv.org/pdf/2112.11446.pdf#subsection.G.2. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model.
If it were really true that SGD minimized loss (mod expressivity), knowledge distillation wouldn’t reduce training loss, much less minorize it. And this matters for our discussion, because if one abstracts SGD as “well this local behavior of loss-reduction basically adds up to global loss-minimization, as the ‘terminal goal’ in some loose sense”, this abstraction is in fact wrong. (LMK if you meant to claim something else, not trying to pigeonhole you here!)
And this ties into my broader point, because I consider myself to be saying “you can’t just abstract this system as ‘trying to make evaluations come out high’; the dynamics really do matter, and considering the situation in more detail does change the conclusions.” I think this is a direct analogue of the SGD case. I reviewed that case in reward is not the optimization target, and now consider this set of posts to do a similar move for values-executing agents being grader-executers, not grader-maximizers.
I don’t really disagree with any of what you’re saying but I also don’t see why it matters.
I consider myself to be saying “you can’t just abstract this system as ‘trying to make evaluations come out high’; the dynamics really do matter, and considering the situation in more detail does change the conclusions.”
I’m on board with the first part of this, but I still don’t see the part where it changes any conclusions. From my perspective your responses are of the form “well, no, your abstract argument neglects X, Y and Z details” rather than explaining how X, Y and Z details change the overall picture.
For example, in the above comment you’re talking about how the planner will be different if the shards are different, because the historical reinforcement-events would be different. I agree with that. But then it seems like if you want to argue that one is safer than the other, you have to talk about the historical reinforcement-events and how they arose, whereas all of your discussion of grader-optimizers vs values-executors doesn’t talk about the historical reinforcement-events at all, and instead talks about the motivational architecture while screening off the historical reinforcement-events.
(Indeed, my original comment was specifically asking about what your story was for the historical reinforcement-events for values-executors: “Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?”)
I don’t really disagree with any of what you’re saying but I also don’t see why it matters. … Indeed, my original comment was specifically asking about what your story was for the historical reinforcement-events for values-executors
I was pretty surprised by the values-executor pseudocode in Appendix B, because it seems like a bog-standard consequentialist which I would have thought you’d consider as a grader-optimizer. In particular you can think of the pseudocode as follows:
Grader-optimizer: planModificationSample + the for loop that keeps improving the plan based on proposed modifications
If you agree that [planModificationSample + the for loop] is a grader-optimizer, why isn’t this an example of an alignment approach involving a grader-optimizer that could plausibly work?
If you don’t agree that [planModificationSample + the for loop] is a grader-optimizer, then why not, and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?
You also said:
I saw that and I don’t understand why it rules out planModificationSample + the associated for loop as a grader-optimizer. Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not “optimizing the outputs of the grader as its main terminal motivation”?
And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
Sounds like we mostly disagree on cumulative effort to: (get a grader-optimizer to do good things) vs (get a values-executing agent to do good things).
We probably perceive the difficulty as follows:
Getting the target configuration into an agent
Grader-optimization
Alex: Very very hard
Rohin: Hard
Values-executing
Alex: Moderate/hard
Rohin: Hard
Aligning the target configuration such that good things happen (e.g. makes diamonds), conditional on the intended cognitive patterns being instilled to begin with (step 1)
Grader-optimization
Alex: Extremely hard
Rohin: Very hard
Values-executing
Alex: Hard
Rohin: Hard
Does this seem reasonable? We would then mostly disagree on relative difficulty of 1a vs 1b.
Separately, I apologize for having given an incorrect answer earlier, which you then adopted, and then I berated you for adopting my own incorrect answer—how simplistic of you! Urgh.
I had said:
and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?
Oh, I would change self.diamondShard to self.diamondShardShard?
But I should also have mentioned the change in planModificationSample. Sorry about that.
And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
I don’t think I agree with this.
At a high level, your argument can be thought of as having two steps:
Grader-optimizers are bad, because of problem P.
Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.
I’ve been trying to resolve disagreement along one of two pathways:
Collapse the argument into a single statement “approval-directed agents are bad because of problem P”, and try to argue about that statement. (Strategy in the previous comment thread, specifically by arguing that problem P also applied to other approaches.)
Understand what you mean by grader-optimizers, and then figure out which of the two steps of your argument I disagree with, so that we can focus on that subclaim instead. (Strategy for most of this comment thread.)
Unfortunately, I don’t think I have a sufficient definition (intensional or extensional) of grader-optimizers to say which of the two steps I disagree with. I don’t have a coherent concept in my head that says your pseudocode isn’t a grader-optimizer and approval-directed agents are grader-optimizers. (The closest is the “grader is complicit” thing, which I think probably could be made coherent, but it would say that your pseudocode isn’t a grader-optimizer and is agnostic / requires more details for approval-directed agents.)
In my previous comment I switched back from strategy 2 to strategy 1 since that seemed more relevant to your response but I should have signposted it more, sorry about that.
But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?”)
I think the truth is even worse than that, in that deceptively aligned value children are the default scenario without myopia and solely causal decision theory or variants of it, and Turntrout either has good arguments for why this isn’t favored, or Turntrout hopes that deceptive alignment is defined away.
For reasons why deceptive alignment might be favored, I have a link below:
Overall disagreement:
Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it’s well captured by the “grader is complicit” point in my previous comment, which you presumably disagree with).
But I don’t see how to extend the extensional definition far enough to get to the conclusion that IDA, debate, RRM etc aren’t going to work.
Okay, that makes sense. So then the implementations of the shards would look like:
But ultimately the
conseq
queries just report the results of whatever cognition the world model does, so these implementations are equivalent to:Two notes on this:
Either way you are choosing plans on the basis of the output of some predictive / evaluative model. In the first case the predictive / evaluative model is the world model itself, in the second case it is the composition of the diamond grader and the world model.
It’s not obvious to me that
diamondShardShard
is terrible for getting diamonds—it depends on whatdiamondGrader
does! IfdiamondGrader
also gets to reflectively consider the plan, and produces low outputs for plans that (it can tell) would lead to it being tricked / replaced in the future, then it seems like it could work out fine.In either case I think you’re depending on something (either the world model, or the diamond grader) to notice when a plan is going to trick / deceive an evaluator. In both cases all aspects of the planner remain the same (as you suggested you only need to change
diamondShard
todiamondShardShard
rather than changing anything in the planner), so it’s not the adversarial optimization from the planner is different in magnitude. So it seems like it has to come down to some quantitative prediction thatdiamondGrader
is going to be very bad at noticing cases where it’s being tricked.Other responses that probably aren’t important:
This argument seems right to me (modulo your point in the footnote).
Hmm, how does it have very different properties? This feels like a decent first-order approximation[1].
Certainly it is not exactly accurate—you could easily construct examples of non-convex loss landscapes where (1) the one global minimum is hidden in a deep narrow valley surrounded in all directions by very high loss and (2) the local minima are qualitatively different from the global minimum.
Having a
diamondShard
and adiamondGraderShard
will mean that the generative models will be differently tuned! Not only does an animal-welfare activist grade plans based on predictions about different latent quantities (e.g. animal happiness) than a businessman (e.g. how well their firm does), the two will sample different plans fromself.WM.planModificationSample
! The vegan and businessman have different generative models because they historically cared about different quantities, and so collected different training data, which differently refined their predictive and planning machinery…One of my main lessons was (intended to be) that “agents” are not just a “generative model” and a “grading procedure”, with each slot hot-swappable to different models or graders! One should not reduce a system to “where the plans come from” and “how they get graded”; these are not disjoint slots in practice (even though they are disjoint, in theory). Each system has complex and rich dynamics, and you need to holistically consider what plans get generated and how they get graded in order to properly predict the overall behavior of a system.
To address our running example—if an agent has a
diamondGraderShard
, that was brought into existence by reinforcement events for making thediamondGrader
output a high number. This kind of agent has internalized tricks and models around thediamondGrader
in particular, and would e.g. freely generate plans like “study thediamondGrader
implementation.”On the other hand, the
diamondShard
agent would be tuned to generate plans which have to do with diamonds. It’s still true that an “accidental” / “upwards-noise” generation could trick the internal diamond grader, but there would not have been historical reinforcement events which accrued into internal generative models which e.g. sample plans about doing adversarial attacks on parts of the agent’s own cognition. So I would in fact be surprised to find a free-standing diamond-shard-agent generate the plan “attack thediamondGrader
implementation”, but I wouldn’t be that surprised if adiamondGraderShard
’s generative model sampled that plan.So it’s not that the
diamondGrader
is complicit;diamondGrader
doesn’t even get a say under the hypothetical I’m imagining (it’s not a shard, it’s just an evaluative submodule which usually lays dormant). It’s thatdiamondGraderShard
and its corresponding generative model are tuned to exert active optimization power to adversarially attack thediamondGrader
.The reason this applies to approval-directed agents is that we swap out
diamondGraderShard
forapprovalGraderShard
anddiamondGrader
forapprovalGrader
.The reason this doesn’t apply to actual humans is that humans only have e.g.
happinessShard
s (to a simplification), and nothappinessGraderShards
. This matters in part because this means their generative models aren’t tuned to generate plans which exploit thehappinessShard
, and in other part because thehappinessShard
isn’t actively bidding for plans where thehappinessShard
/happinessGrader
gets duped but where happiness isn’t straightforwardly predicted by the WM.Because SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample.
From Quintin Pope:
If it were really true that SGD minimized loss (mod expressivity), knowledge distillation wouldn’t reduce training loss, much less minorize it. And this matters for our discussion, because if one abstracts SGD as “well this local behavior of loss-reduction basically adds up to global loss-minimization, as the ‘terminal goal’ in some loose sense”, this abstraction is in fact wrong. (LMK if you meant to claim something else, not trying to pigeonhole you here!)
And this ties into my broader point, because I consider myself to be saying “you can’t just abstract this system as ‘trying to make evaluations come out high’; the dynamics really do matter, and considering the situation in more detail does change the conclusions.” I think this is a direct analogue of the SGD case. I reviewed that case in reward is not the optimization target, and now consider this set of posts to do a similar move for values-executing agents being grader-executers, not grader-maximizers.
I don’t really disagree with any of what you’re saying but I also don’t see why it matters.
I’m on board with the first part of this, but I still don’t see the part where it changes any conclusions. From my perspective your responses are of the form “well, no, your abstract argument neglects X, Y and Z details” rather than explaining how X, Y and Z details change the overall picture.
For example, in the above comment you’re talking about how the planner will be different if the shards are different, because the historical reinforcement-events would be different. I agree with that. But then it seems like if you want to argue that one is safer than the other, you have to talk about the historical reinforcement-events and how they arose, whereas all of your discussion of grader-optimizers vs values-executors doesn’t talk about the historical reinforcement-events at all, and instead talks about the motivational architecture while screening off the historical reinforcement-events.
(Indeed, my original comment was specifically asking about what your story was for the historical reinforcement-events for values-executors: “Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?”)
Uh, I’m confused. From your original comment in this thread:
You also said:
And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader
self.diamondShard(self.WM.getConseq(plan))
, and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?Sounds like we mostly disagree on cumulative effort to: (get a grader-optimizer to do good things) vs (get a values-executing agent to do good things).
We probably perceive the difficulty as follows:
Getting the target configuration into an agent
Grader-optimization
Alex: Very very hard
Rohin: Hard
Values-executing
Alex: Moderate/hard
Rohin: Hard
Aligning the target configuration such that good things happen (e.g. makes diamonds), conditional on the intended cognitive patterns being instilled to begin with (step 1)
Grader-optimization
Alex: Extremely hard
Rohin: Very hard
Values-executing
Alex: Hard
Rohin: Hard
Does this seem reasonable? We would then mostly disagree on relative difficulty of 1a vs 1b.
Separately, I apologize for having given an incorrect answer earlier, which you then adopted, and then I berated you for adopting my own incorrect answer—how simplistic of you! Urgh.
I had said:
But I should also have mentioned the change in
planModificationSample
. Sorry about that.I don’t think I agree with this.
At a high level, your argument can be thought of as having two steps:
I’ve been trying to resolve disagreement along one of two pathways:
Collapse the argument into a single statement “approval-directed agents are bad because of problem P”, and try to argue about that statement. (Strategy in the previous comment thread, specifically by arguing that problem P also applied to other approaches.)
Understand what you mean by grader-optimizers, and then figure out which of the two steps of your argument I disagree with, so that we can focus on that subclaim instead. (Strategy for most of this comment thread.)
Unfortunately, I don’t think I have a sufficient definition (intensional or extensional) of grader-optimizers to say which of the two steps I disagree with. I don’t have a coherent concept in my head that says your pseudocode isn’t a grader-optimizer and approval-directed agents are grader-optimizers. (The closest is the “grader is complicit” thing, which I think probably could be made coherent, but it would say that your pseudocode isn’t a grader-optimizer and is agnostic / requires more details for approval-directed agents.)
In my previous comment I switched back from strategy 2 to strategy 1 since that seemed more relevant to your response but I should have signposted it more, sorry about that.
I think the truth is even worse than that, in that deceptively aligned value children are the default scenario without myopia and solely causal decision theory or variants of it, and Turntrout either has good arguments for why this isn’t favored, or Turntrout hopes that deceptive alignment is defined away.
For reasons why deceptive alignment might be favored, I have a link below:
https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment