TurnTrout comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout Feb 13, 2023, 11:27 PM
LW: 2 AF: 2
0
AF
I don’t really disagree with any of what you’re saying but I also don’t see why it matters. … Indeed, my original comment was specifically asking about what your story was for the historical reinforcement-events for values-executors
Uh, I’m confused. From your original comment in this thread:
I was pretty surprised by the values-executor pseudocode in Appendix B, because it seems like a bog-standard consequentialist which I would have thought you’d consider as a grader-optimizer. In particular you can think of the pseudocode as follows:
Grader-optimizer: planModificationSample + the for loop that keeps improving the plan based on proposed modifications
Grader-being-optimized: self.diamondShard(self.WM.getConseq(plan))
So my question is:
- If you agree that [planModificationSample + the for loop] is a grader-optimizer, why isn’t this an example of an alignment approach involving a grader-optimizer that could plausibly work?
- If you don’t agree that [planModificationSample + the for loop] is a grader-optimizer, then why not, and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?
You also said:
I saw that and I don’t understand why it rules out planModificationSample + the associated for loop as a grader-optimizer. Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not “optimizing the outputs of the grader as its main terminal motivation”?
And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
Sounds like we mostly disagree on cumulative effort to: (get a grader-optimizer to do good things) vs (get a values-executing agent to do good things).
We probably perceive the difficulty as follows:
1. Getting the target configuration into an agent
  1. Grader-optimization
    Alex: Very very hard
    Rohin: Hard
  2. Values-executing
    Alex: Moderate/hard
    Rohin: Hard
2. Aligning the target configuration such that good things happen (e.g. makes diamonds), conditional on the intended cognitive patterns being instilled to begin with (step 1)
  1. Grader-optimization
    Alex: Extremely hard
    Rohin: Very hard
  2. Values-executing
    Alex: Hard
    Rohin: Hard
Does this seem reasonable? We would then mostly disagree on relative difficulty of 1a vs 1b.
Separately, I apologize for having given an incorrect answer earlier, which you then adopted, and then I berated you for adopting my own incorrect answer—how simplistic of you! Urgh.
I had said:
and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?
Oh, I would change self.diamondShard to self.diamondShardShard?
But I should also have mentioned the change in planModificationSample. Sorry about that.
- Rohin Shah Feb 15, 2023, 10:24 AM
  LW: 5 AF: 3
  0
  AF Parent
  And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
  I don’t think I agree with this.
  At a high level, your argument can be thought of as having two steps:
  Grader-optimizers are bad, because of problem P.
  Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.
  I’ve been trying to resolve disagreement along one of two pathways:
  1. Collapse the argument into a single statement “approval-directed agents are bad because of problem P”, and try to argue about that statement. (Strategy in the previous comment thread, specifically by arguing that problem P also applied to other approaches.)
  2. Understand what you mean by grader-optimizers, and then figure out which of the two steps of your argument I disagree with, so that we can focus on that subclaim instead. (Strategy for most of this comment thread.)
  Unfortunately, I don’t think I have a sufficient definition (intensional or extensional) of grader-optimizers to say which of the two steps I disagree with. I don’t have a coherent concept in my head that says your pseudocode isn’t a grader-optimizer and approval-directed agents are grader-optimizers. (The closest is the “grader is complicit” thing, which I think probably could be made coherent, but it would say that your pseudocode isn’t a grader-optimizer and is agnostic / requires more details for approval-directed agents.)
  In my previous comment I switched back from strategy 2 to strategy 1 since that seemed more relevant to your response but I should have signposted it more, sorry about that.