Oops, I think the confusion is about what counts as “in-distribution”, probably because I myself used it inconsistently just now. In my other comment, I’d referred to training on a single society as “in-distribution”, but in the previous comment in this thread, “displace the human to a different society” was supposed to be part of the training.
Suppose that, as above, we’re trying to train an utilitarian.
Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society’s norms, every other society would be OOD for it.
If we train on a single society, then gradient starvation would set in as you’re describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.
But imagine if we’re training on different societies, often throwing in societies that are OOD relative to the previous training data. It’d need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.
Thus, gradient starvation would never actually set in at the level of shallow heuristics.
(Instead, it’d set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)
Oops, I think the confusion is about what counts as “in-distribution”, probably because I myself used it inconsistently just now. In my other comment, I’d referred to training on a single society as “in-distribution”, but in the previous comment in this thread, “displace the human to a different society” was supposed to be part of the training.
Suppose that, as above, we’re trying to train an utilitarian.
Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society’s norms, every other society would be OOD for it.
If we train on a single society, then gradient starvation would set in as you’re describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.
But imagine if we’re training on different societies, often throwing in societies that are OOD relative to the previous training data. It’d need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.
Thus, gradient starvation would never actually set in at the level of shallow heuristics.
(Instead, it’d set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)