Based on this, my general sense is that quantilizers don’t make generative models much more useful for alignment
Right, the point of quantilizers is not to make generative models safer. It’s to be safer than non-generative models (in cases where the training distribution is in fact safe and you don’t need to filter very hard to succeed at the task).
I expect the purely statistical safety/filtering tradeoff to actually be pretty unimportant. More important are the vulnerabilities that come from the training distribution actually not being safe in the first place. The performance cost also does seem pretty important, but could potentially be sidestepped (maybe train a student model on filtered data) if safety was actually solved.
Right, the point of quantilizers is not to make generative models safer. It’s to be safer than non-generative models (in cases where the training distribution is in fact safe and you don’t need to filter very hard to succeed at the task).
I expect the purely statistical safety/filtering tradeoff to actually be pretty unimportant. More important are the vulnerabilities that come from the training distribution actually not being safe in the first place. The performance cost also does seem pretty important, but could potentially be sidestepped (maybe train a student model on filtered data) if safety was actually solved.