I think the argument beyond 5D hinges on the fact that Bs and Gs will be represented in the WM such that the GPS can take it as part of the problem specification.
Arguments in favor:
GPS has been part of the heuristics (shards), so it needs to be able to use their communication channel. This implies that the GPS reverse-engineered the heuristics. Since GPS has write access to the WM, that implies the Bs and Gs might be included there.
Once included in the WM, it isn’t too hard for the SGD to point the GPS towards it. At that point, there’s a positive feedback loop that incentivizes both (1) pointing GPS even more towards Bs and Gs and (2) making B/G representation even more explicit in the WM.
Arguments against:
By the point 5D happens, GPS should already be well-developed and part of the heuristics, which means they would be a very good approximation of U. This implies strong gradient starvation, so there just might not be any incentive for SGD to do any of this.
If GPS becomes critically reflective before sufficient B/G representation in the WM, then gradient hacking locks in the heuristic-driven-GPS forever.
So, it’s either (1) they do get represented and arguments after 5D holds, or (2) they don’t get represented and the heuristics end up dominating, basically the shard theory picture.
I think this might be one of the line that divides your model with the Shard Theory model, and as of now I’m very uncertain as to which picture is more likely.
By the point 5D happens, GPS should already be well-developed and part of the heuristics, which means they would be a very good approximation of U.
My argument is that they wouldn’t actually be a good cross-context approximation of U; in part because of gradient starvation.
E. g., suppose we’re training a human to be an utilitarian, and we’re doing it on a dataset of the norms of a particular society. By default, the human would learn said norms, and then stop there, because following the norms is good enough for making people happy in-distribution. If we then displace them to a different society, they’d try to act on their previous society’s norms, and that’s not going to make people in the new society happy.
To handle such distribution shifts, we instill a desire to do value-compilation into the human, figure out what their current values are for, and then care about the output of value compilation and ignore the inputs to value compilation (the initial norms).
So we get someone who starts out with local norms, figures out they’re for making people happy, and when they move, they figure out what makes people happy in the new society, and re-derive all the new heuristics for that.
My argument is that they wouldn’t actually be a good cross-context approximation of U; in part because of gradient starvation.
Ah bad phrasing—where you quoted me (arguments against part) I meant to say:
Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
… and this is happening at a phase where SGD is still the dominant force
… and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the “implicit” M → A procedures/modules
… therefore “gradient starvation would imply SGD won’t have incentives to represent Bs and Ds as part of the WM”
(I’m intentionally ignoring the value compilation and focusing on 5D because (1) it seems like having Bs and Ds represented is a necessary precursor for all the stuff that comes after that which makes it a useful part to zoom at, and (2) I haven’t really settled my thoughts/confusions on value-compilation)
Does my arguments in favor/against in the original comment capture your thoughts on how likely it is that Bs and Ds get represented in the WM? And is your positive conclusion because any one of them seem more likely to matter?
Oops, I think the confusion is about what counts as “in-distribution”, probably because I myself used it inconsistently just now. In my other comment, I’d referred to training on a single society as “in-distribution”, but in the previous comment in this thread, “displace the human to a different society” was supposed to be part of the training.
Suppose that, as above, we’re trying to train an utilitarian.
Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society’s norms, every other society would be OOD for it.
If we train on a single society, then gradient starvation would set in as you’re describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.
But imagine if we’re training on different societies, often throwing in societies that are OOD relative to the previous training data. It’d need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.
Thus, gradient starvation would never actually set in at the level of shallow heuristics.
(Instead, it’d set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)
I think the argument beyond 5D hinges on the fact that Bs and Gs will be represented in the WM such that the GPS can take it as part of the problem specification.
Arguments in favor:
GPS has been part of the heuristics (shards), so it needs to be able to use their communication channel. This implies that the GPS reverse-engineered the heuristics. Since GPS has write access to the WM, that implies the Bs and Gs might be included there.
Once included in the WM, it isn’t too hard for the SGD to point the GPS towards it. At that point, there’s a positive feedback loop that incentivizes both (1) pointing GPS even more towards Bs and Gs and (2) making B/G representation even more explicit in the WM.
Arguments against:
By the point 5D happens, GPS should already be well-developed and part of the heuristics, which means they would be a very good approximation of U. This implies strong gradient starvation, so there just might not be any incentive for SGD to do any of this.
If GPS becomes critically reflective before sufficient B/G representation in the WM, then gradient hacking locks in the heuristic-driven-GPS forever.
So, it’s either (1) they do get represented and arguments after 5D holds, or (2) they don’t get represented and the heuristics end up dominating, basically the shard theory picture.
I think this might be one of the line that divides your model with the Shard Theory model, and as of now I’m very uncertain as to which picture is more likely.
My argument is that they wouldn’t actually be a good cross-context approximation of U; in part because of gradient starvation.
E. g., suppose we’re training a human to be an utilitarian, and we’re doing it on a dataset of the norms of a particular society. By default, the human would learn said norms, and then stop there, because following the norms is good enough for making people happy in-distribution. If we then displace them to a different society, they’d try to act on their previous society’s norms, and that’s not going to make people in the new society happy.
To handle such distribution shifts, we instill a desire to do value-compilation into the human, figure out what their current values are for, and then care about the output of value compilation and ignore the inputs to value compilation (the initial norms).
So we get someone who starts out with local norms, figures out they’re for making people happy, and when they move, they figure out what makes people happy in the new society, and re-derive all the new heuristics for that.
It’s a sort of hack to avoid gradient starvation.
Ah bad phrasing—where you quoted me (arguments against part) I meant to say:
Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
… and this is happening at a phase where SGD is still the dominant force
… and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the “implicit” M → A procedures/modules
… therefore “gradient starvation would imply SGD won’t have incentives to represent Bs and Ds as part of the WM”
(I’m intentionally ignoring the value compilation and focusing on 5D because (1) it seems like having Bs and Ds represented is a necessary precursor for all the stuff that comes after that which makes it a useful part to zoom at, and (2) I haven’t really settled my thoughts/confusions on value-compilation)
Does my arguments in favor/against in the original comment capture your thoughts on how likely it is that Bs and Ds get represented in the WM? And is your positive conclusion because any one of them seem more likely to matter?
Oops, I think the confusion is about what counts as “in-distribution”, probably because I myself used it inconsistently just now. In my other comment, I’d referred to training on a single society as “in-distribution”, but in the previous comment in this thread, “displace the human to a different society” was supposed to be part of the training.
Suppose that, as above, we’re trying to train an utilitarian.
Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society’s norms, every other society would be OOD for it.
If we train on a single society, then gradient starvation would set in as you’re describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.
But imagine if we’re training on different societies, often throwing in societies that are OOD relative to the previous training data. It’d need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.
Thus, gradient starvation would never actually set in at the level of shallow heuristics.
(Instead, it’d set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)