I just wanted to say thanks for writing this. It is important, interesting, and helping to shape and clarify my views.
I would love to hear a training story where a good outcome for humanity is plausibly achieved using these ideas. I guess it’d rely heavily on interpretability to verify what shards / values are being formed early in training, and regular changes to the training scenario and reward function to change them before the agent is capable enough to subvert attempts to be changed.
Edit: I forgot you also wrote A shot at the diamond-alignment problem, which is basically this. Though it only assumes simple training techniques (no advanced interpretability) to solve a simpler problem.
I just wanted to say thanks for writing this. It is important, interesting, and helping to shape and clarify my views.
I would love to hear a training story where a good outcome for humanity is plausibly achieved using these ideas. I guess it’d rely heavily on interpretability to verify what shards / values are being formed early in training, and regular changes to the training scenario and reward function to change them before the agent is capable enough to subvert attempts to be changed.
Edit: I forgot you also wrote A shot at the diamond-alignment problem, which is basically this. Though it only assumes simple training techniques (no advanced interpretability) to solve a simpler problem.