Thanks for this post, it’s clear and insightful about RLHF.
From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we’re aiming for and debugging problems there, rather than thinking about the guarantees of the inference?
(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)
I’m not sure what is the best energy allocation between modelling and inference here. I think, however, that the modelling part is more neglected (the target distribution is rarely even considered as something that can be written down and analysed). Moreover, designing good target distributions can be quite alignment-specific whereas designing algorithms for inference in probabilistic graphical models is an extremely generic research problem so we can expect progress here anyway.
Thanks for this post, it’s clear and insightful about RLHF.
From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we’re aiming for and debugging problems there, rather than thinking about the guarantees of the inference?
(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)
I’m glad you found our post insightful!
I’m not sure what is the best energy allocation between modelling and inference here. I think, however, that the modelling part is more neglected (the target distribution is rarely even considered as something that can be written down and analysed). Moreover, designing good target distributions can be quite alignment-specific whereas designing algorithms for inference in probabilistic graphical models is an extremely generic research problem so we can expect progress here anyway.