Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.
Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the ‘standard model’ of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.
I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.
Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.
Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the ‘standard model’ of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.
I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.
Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.