Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment—which of course most people can’t do because they project the category boundary onto the environment, but I have some credit that John Wentworth might be able to do it some—and then you start mapping out concept definitions about corrigibility or values or god help you CEV, that might help highlight where some of my concern about unnatural abstractions comes in.
I agree with something like the claim that the definition of concepts like human values depend on their internals, and are reflective, and that the environment doesn’t have an objective morality/values (I’m a moral relativist, and sympathetic to moral anti-realism), but I wouldn’t go as far as saying that data on human values in the environment is entirely uninformative, and more importantly, without other background assumptions, this can’t get us to a state where the alignment problem for AI is plausibly hard, and it excludes too few models to be useful.
I agree with something like the claim that the definition of concepts like human values depend on their internals, and are reflective, and that the environment doesn’t have an objective morality/values (I’m a moral relativist, and sympathetic to moral anti-realism), but I wouldn’t go as far as saying that data on human values in the environment is entirely uninformative, and more importantly, without other background assumptions, this can’t get us to a state where the alignment problem for AI is plausibly hard, and it excludes too few models to be useful.