In section 4.6, you described an “unnatural” reward function splintering, and went on to advocate for more natural ones. I would agree with your argument as a general principle, but on the other hand I can think of situations where an exceptional case should be accounted for. Suppose that the manager of the rube-blegg factory keeps a single rube and a single blegg in a display case on the factory floor to present to touring visitors. A rube classifier which physically sorts rubes and bleggs should be able to recognize that these displayed examples are not to be sorted with the others, even though this requires making an unnatural extension of its internal reward function. I think your examples in Section 6 of suitably deferring to human values upon model splintering could resolve this, but to me it highlights that a naive approach to model splintering could result in problems if the AI is not keeping track of enough features of the world to identify when an automatic “natural” extension of its model is inappropriate.
In section 4.6, you described an “unnatural” reward function splintering, and went on to advocate for more natural ones. I would agree with your argument as a general principle, but on the other hand I can think of situations where an exceptional case should be accounted for. Suppose that the manager of the rube-blegg factory keeps a single rube and a single blegg in a display case on the factory floor to present to touring visitors. A rube classifier which physically sorts rubes and bleggs should be able to recognize that these displayed examples are not to be sorted with the others, even though this requires making an unnatural extension of its internal reward function.
I think your examples in Section 6 of suitably deferring to human values upon model splintering could resolve this, but to me it highlights that a naive approach to model splintering could result in problems if the AI is not keeping track of enough features of the world to identify when an automatic “natural” extension of its model is inappropriate.