Steven Byrnes comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Steven Byrnes 30 Nov 2022 18:07 UTC
LW: 7 AF: 4
−1
AF
I mostly just want to repeat my comment on your last post.
I think your opposition to graders is really opposition to simple graders, that are never updated, that can’t account for non-consequentialist aspects of plans (e.g. “sketchiness”), and that are facing an extremely large search space of possibilities including out-of-the-box ones. And I think your value-vs-evaluation distinction is kinda different from graders-vs-non-graders.
So…
- For “nonrobust decision-influences can be OK”—I don’t think that’s a unique feature of not-having-a-grader. If there is a grader, but the grader is of the form “Here are a billion patterns with corresponding grades, try to pattern-match your plan to all billion of those patterns and do a weighted average”, then probably you can throw out a few of those billion patterns and the grader will still work the same.
- For “values steer optimization; they are not optimized against”—I think you’re comparing apples and oranges. Let’s say I’m a human. I want “diamonds (as understood by me)”. So I attempt to program an AGI to want “diamonds (as understood by me)”.
  - In the framework you advocate, the AGI winds up “directly” “valuing” “diamonds (as understood by the AGI)”. And this can go wrong because “diamonds (as understood by me)” may differ from “diamonds (as understood by the AGI)”. If that’s what happens, then from my perspective, the AGI “was looking for, and found, an edge-case exploit”. From the AGI’s own perspective, all it was doing was “finding an awesome out-of-the-box way to make lots of diamonds”.
  - Whereas in the grader-optimizer framework, I delegate to a grader, and the AGI does the things that increase “diamonds (as understood by the grader)”. And this can go wrong because “diamonds (as understood by me)” may differ from “diamonds (as understood by the grader)”. From my perspective, the AGI is again “looking for edge-case exploits”.
  - It’s really the same problem, but in the first case you can temporarily forget the fact that I, the programmer, exist, and then there seems not to be any conflict / exploits / optimizing-against in the system. But the conflict is still there! It’s just off-stage.
- For “Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values”—Again, first of all, it’s the AGI itself that is deciding what is or isn’t adversarial, and the things that are adversarial from the perspective of the programmer might be just a great clever out-of-the-box idea from the perspective of the AGI. Second of all, I don’t think the things you’re saying are incompatible with graders, they’re just incompatible with “simple static graders”.
- TurnTrout 26 Dec 2022 18:59 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Steve and I talked more, and I think the perceived disagreement stemmed from unclear writing on my part. I recently updated Don’t design agents which exploit adversarial inputs to clarify:
  ETA 12/26/22: When I write “grader optimization”, I don’t mean “optimization that includes a grader”, I mean “the grader’s output is the main/only quantity being optimized by the actor.”
  Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I’m not a grader-optimizer relative my internal plan-is-fun? grader.
  However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.