Alternatively, perhaps your galaxy-brain example was set during which the GPS already overpowered SGD and can pursue whatever mesa-objective it had earlier on
Yup. It doesn’t necessarily have to respect in-distribution behavior. To re-use an example:
Suppose that you have a shard that looks for conditions like “you’re at night in a forest and you heard rustling in the grass behind you”, then floods your system with adrenaline and gets you ready for a fight. Via self-reflection, you realize that this implements a heuristic that tries to warn you about a potential attack. You reason that this means you value not getting attacked. You adopt “not getting attacked” as your goal.
Then, you switch contexts. You’re in a different forest now, and you’re ~100% sure there are no dangerous animals or people around. You hear rustling in the grass behind you. You know it’s not a threat, so you suppress the shard’s insistence to turn around and get ready for a fight.
You do not respect your in-distribution behavior.
Same can go for e. g. moral principles. Suppose you grew up in a society with a lot of weird norms, like “it’s disrespectful to shake people awake”, and then did reflection on these learned norms and figured they’re optimized for making people happy. You adopt “make people happy” as your value, and then end up moving. In the new society, there are different norms. Instead of being confused by norm-incompatibility, you just figure out what behaviors in this society make people happy, and do them.
Or, again, deontology to utilitarianism. “Don’t kill people” and such rules are optimized for advancing human welfare, but if you adopt “human welfare” as your explicit goal, you may sometimes violate that initial rule to e. g. kill a serial killer.
The concern, here, is that the AGI may do something like this with “keep humanity around”. That it’s just a local instantiation of some higher-level principle, that can be served better by killing humans off and replacing them with something else — like there’s no need to respect “be afraid of rustling” if the forest has no predators, or “don’t shake people awake” if the person you’re shaking awake doesn’t mind.
What even would an good OOD extrapolation of human values look like?
No idea. I mean, it presumably involves building an immortal eudaimonic utopia, whatever “eudaimonia” means, but no I haven’t solved the entirety of moral philosophy. Just developed a model for describing the process of moral philosophy.
Why do you think the galaxy-brained merger is bad?
See above.
Is my understanding: “GPS’s tendency to value-compile initially forms when SGD is still a dominant force in the training dynamic” correct?
Yup.
And when SGD loses control and GPS’s value-compilation becomes the dominant force, how would that value-compilation generalize? Would it do so in a way that respects earlier-in-training in-distribution samples? (like not killing humans)
The concern is that it wouldn’t respect in-distribution samples. Because of inner misalignment: it wouldn’t generalize values into the actual outer objective, it’d generalize them towards some not-that-good correlate of the actual training objective (like human values are a proxy goal for “maximize inclusive genetic fitness”).
Yup. It doesn’t necessarily have to respect in-distribution behavior. To re-use an example:
Suppose that you have a shard that looks for conditions like “you’re at night in a forest and you heard rustling in the grass behind you”, then floods your system with adrenaline and gets you ready for a fight. Via self-reflection, you realize that this implements a heuristic that tries to warn you about a potential attack. You reason that this means you value not getting attacked. You adopt “not getting attacked” as your goal.
Then, you switch contexts. You’re in a different forest now, and you’re ~100% sure there are no dangerous animals or people around. You hear rustling in the grass behind you. You know it’s not a threat, so you suppress the shard’s insistence to turn around and get ready for a fight.
You do not respect your in-distribution behavior.
Same can go for e. g. moral principles. Suppose you grew up in a society with a lot of weird norms, like “it’s disrespectful to shake people awake”, and then did reflection on these learned norms and figured they’re optimized for making people happy. You adopt “make people happy” as your value, and then end up moving. In the new society, there are different norms. Instead of being confused by norm-incompatibility, you just figure out what behaviors in this society make people happy, and do them.
Or, again, deontology to utilitarianism. “Don’t kill people” and such rules are optimized for advancing human welfare, but if you adopt “human welfare” as your explicit goal, you may sometimes violate that initial rule to e. g. kill a serial killer.
The concern, here, is that the AGI may do something like this with “keep humanity around”. That it’s just a local instantiation of some higher-level principle, that can be served better by killing humans off and replacing them with something else — like there’s no need to respect “be afraid of rustling” if the forest has no predators, or “don’t shake people awake” if the person you’re shaking awake doesn’t mind.
No idea. I mean, it presumably involves building an immortal eudaimonic utopia, whatever “eudaimonia” means, but no I haven’t solved the entirety of moral philosophy. Just developed a model for describing the process of moral philosophy.
See above.
Yup.
The concern is that it wouldn’t respect in-distribution samples. Because of inner misalignment: it wouldn’t generalize values into the actual outer objective, it’d generalize them towards some not-that-good correlate of the actual training objective (like human values are a proxy goal for “maximize inclusive genetic fitness”).