ETA 12/26/22: When I write “grader optimization”, I don’t mean “optimization that includes a grader”, I mean “the grader’s output is the main/only quantity being optimized by the actor.”
Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I’m not a grader-optimizer relative my internal plan-is-fun? grader.
However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.
Steve and I talked more, and I think the perceived disagreement stemmed from unclear writing on my part. I recently updated Don’t design agents which exploit adversarial inputs to clarify: