My argument is that they wouldn’t actually be a good cross-context approximation of U; in part because of gradient starvation.
Ah bad phrasing—where you quoted me (arguments against part) I meant to say:
Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
… and this is happening at a phase where SGD is still the dominant force
… and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the “implicit” M → A procedures/modules
… therefore “gradient starvation would imply SGD won’t have incentives to represent Bs and Ds as part of the WM”
(I’m intentionally ignoring the value compilation and focusing on 5D because (1) it seems like having Bs and Ds represented is a necessary precursor for all the stuff that comes after that which makes it a useful part to zoom at, and (2) I haven’t really settled my thoughts/confusions on value-compilation)
Does my arguments in favor/against in the original comment capture your thoughts on how likely it is that Bs and Ds get represented in the WM? And is your positive conclusion because any one of them seem more likely to matter?
Oops, I think the confusion is about what counts as “in-distribution”, probably because I myself used it inconsistently just now. In my other comment, I’d referred to training on a single society as “in-distribution”, but in the previous comment in this thread, “displace the human to a different society” was supposed to be part of the training.
Suppose that, as above, we’re trying to train an utilitarian.
Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society’s norms, every other society would be OOD for it.
If we train on a single society, then gradient starvation would set in as you’re describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.
But imagine if we’re training on different societies, often throwing in societies that are OOD relative to the previous training data. It’d need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.
Thus, gradient starvation would never actually set in at the level of shallow heuristics.
(Instead, it’d set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)
Ah bad phrasing—where you quoted me (arguments against part) I meant to say:
Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
… and this is happening at a phase where SGD is still the dominant force
… and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the “implicit” M → A procedures/modules
… therefore “gradient starvation would imply SGD won’t have incentives to represent Bs and Ds as part of the WM”
(I’m intentionally ignoring the value compilation and focusing on 5D because (1) it seems like having Bs and Ds represented is a necessary precursor for all the stuff that comes after that which makes it a useful part to zoom at, and (2) I haven’t really settled my thoughts/confusions on value-compilation)
Does my arguments in favor/against in the original comment capture your thoughts on how likely it is that Bs and Ds get represented in the WM? And is your positive conclusion because any one of them seem more likely to matter?
Oops, I think the confusion is about what counts as “in-distribution”, probably because I myself used it inconsistently just now. In my other comment, I’d referred to training on a single society as “in-distribution”, but in the previous comment in this thread, “displace the human to a different society” was supposed to be part of the training.
Suppose that, as above, we’re trying to train an utilitarian.
Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society’s norms, every other society would be OOD for it.
If we train on a single society, then gradient starvation would set in as you’re describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.
But imagine if we’re training on different societies, often throwing in societies that are OOD relative to the previous training data. It’d need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.
Thus, gradient starvation would never actually set in at the level of shallow heuristics.
(Instead, it’d set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)