He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns “10 million” instead of the usual “8″ or “9”.
As outside humans reading the article we can say that humans are flawed graders but from the viewpoint of the learner he would not say it is flawed. We might say that value is fragile or multidimensional but we would not reject the structure for unnaturalness. Sure we might have “meta-values” in that if we are dealing with a very elaborate value system we might think it should not be in use because it is a “hackjob”. But those values come /get reinforced by something other than the “object level” feedback. If drawing very specific chalk patterns produced magic effects you would found an epistemic branch to exploit it rather than be disappointed with reality.
The analog is suffering from one side of it being desribed at a very detailed level when the other side is very shallow. If I go and try to fill out the shallow side to a similar depth similar level drawbacks seem to emerge. Learning to “care about hard work” seems to involve the child actively going beyond to what is directly given to him. This seems to have the possibility that two different children might extrapolate differently which would be equally consistent with the parental guidance. For some reason in humans such process seems lead to stable-enough formation, maybe because of architectural monotony. But from this perspective aligment problem is about picking the rigth kind of generalization out of all the possible ones.
As outside humans reading the article we can say that humans are flawed graders but from the viewpoint of the learner he would not say it is flawed. We might say that value is fragile or multidimensional but we would not reject the structure for unnaturalness. Sure we might have “meta-values” in that if we are dealing with a very elaborate value system we might think it should not be in use because it is a “hackjob”. But those values come /get reinforced by something other than the “object level” feedback. If drawing very specific chalk patterns produced magic effects you would found an epistemic branch to exploit it rather than be disappointed with reality.
The analog is suffering from one side of it being desribed at a very detailed level when the other side is very shallow. If I go and try to fill out the shallow side to a similar depth similar level drawbacks seem to emerge. Learning to “care about hard work” seems to involve the child actively going beyond to what is directly given to him. This seems to have the possibility that two different children might extrapolate differently which would be equally consistent with the parental guidance. For some reason in humans such process seems lead to stable-enough formation, maybe because of architectural monotony. But from this perspective aligment problem is about picking the rigth kind of generalization out of all the possible ones.