Rohin Shah comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Rohin Shah 15 Feb 2023 10:24 UTC
LW: 5 AF: 3
0
AF
And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?
I don’t think I agree with this.
At a high level, your argument can be thought of as having two steps:
1. Grader-optimizers are bad, because of problem P.
2. Approval-directed agents / [things built by IDA, debate, RRM] are grader-optimizers.
I’ve been trying to resolve disagreement along one of two pathways:
1. Collapse the argument into a single statement “approval-directed agents are bad because of problem P”, and try to argue about that statement. (Strategy in the previous comment thread, specifically by arguing that problem P also applied to other approaches.)
2. Understand what you mean by grader-optimizers, and then figure out which of the two steps of your argument I disagree with, so that we can focus on that subclaim instead. (Strategy for most of this comment thread.)
Unfortunately, I don’t think I have a sufficient definition (intensional or extensional) of grader-optimizers to say which of the two steps I disagree with. I don’t have a coherent concept in my head that says your pseudocode isn’t a grader-optimizer and approval-directed agents are grader-optimizers. (The closest is the “grader is complicit” thing, which I think probably could be made coherent, but it would say that your pseudocode isn’t a grader-optimizer and is agnostic / requires more details for approval-directed agents.)
In my previous comment I switched back from strategy 2 to strategy 1 since that seemed more relevant to your response but I should have signposted it more, sorry about that.