(Here’s a possibly arcane remark. It’s worth noting that I think always correct reward models are sufficient for high stakes alignment via runtime filtering (technically you just need to never give a very bad output decent reward). So, always correct reward models would be great if you could get them. Note that always correct could look pretty different than current advsarial robustness; in particular, you also need to worry about collusion and stuff like seizing physical control of the reward channel. I also think that avoiding high stakes failure via adversarial training and/or other robustness style tech seems generally good. I’ve been assuming we’re talking about the low stakes alignment problem (aka outer alignment))
(Here’s a possibly arcane remark. It’s worth noting that I think always correct reward models are sufficient for high stakes alignment via runtime filtering (technically you just need to never give a very bad output decent reward). So, always correct reward models would be great if you could get them. Note that always correct could look pretty different than current advsarial robustness; in particular, you also need to worry about collusion and stuff like seizing physical control of the reward channel. I also think that avoiding high stakes failure via adversarial training and/or other robustness style tech seems generally good. I’ve been assuming we’re talking about the low stakes alignment problem (aka outer alignment))