Rohin Shah comments on If you’re very optimistic about ELK then you should be optimistic about outer alignment

Rohin Shah 29 Apr 2022 7:25 UTC
2 points
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.