How do you think about empirical work on scalable oversight? A lot of scalable oversight methods do result in capabilities improvements if they work well. A few concrete examples where this might be the case:
Learning from Human Feedback
Debate
Iterated Amplification
Imitative Generalization
I’m curious which of the above you think it’s net good/bad to get working (or working better) in practice. I’m pretty confused about how to think about work on the above methods; they’re on the main line path for some alignment agendas but also advanced capabilities / reduce serial time to work on the other alignment agendas.
How do you think about empirical work on scalable oversight? A lot of scalable oversight methods do result in capabilities improvements if they work well. A few concrete examples where this might be the case:
Learning from Human Feedback
Debate
Iterated Amplification
Imitative Generalization
I’m curious which of the above you think it’s net good/bad to get working (or working better) in practice. I’m pretty confused about how to think about work on the above methods; they’re on the main line path for some alignment agendas but also advanced capabilities / reduce serial time to work on the other alignment agendas.