The problem with RLHF/DPO is not that it doesn’t work period, the problem is that we don’t know if they work. I can imagine that we can just scale to superintelligence, apply RLHF and get aligned ASI, but this would imply a bunch of things about reality like “even at high level of capability reasonable RLHF-data contains overwhelmingly mostly good value-shaped thought-patterns” and I just don’t think that we know enough about reality to make such statements.
I think this might be a crux, actually. I think it’s surprisingly common in history for things to work out well empirically, but that we either don’t understand how they work, or it took a long time to understand how it works.
AI development is the most central example, but I’d argue the invention of steel is another good example.
To put it another way, I’m relying on the fact that there have been empirically successful interventions where we either simply don’t know why it works, or it takes a long time to get a useful theory out of the empirically successful intervention.
The problem with RLHF/DPO is not that it doesn’t work period, the problem is that we don’t know if they work. I can imagine that we can just scale to superintelligence, apply RLHF and get aligned ASI, but this would imply a bunch of things about reality like “even at high level of capability reasonable RLHF-data contains overwhelmingly mostly good value-shaped thought-patterns” and I just don’t think that we know enough about reality to make such statements.
I think this might be a crux, actually. I think it’s surprisingly common in history for things to work out well empirically, but that we either don’t understand how they work, or it took a long time to understand how it works.
AI development is the most central example, but I’d argue the invention of steel is another good example.
To put it another way, I’m relying on the fact that there have been empirically successful interventions where we either simply don’t know why it works, or it takes a long time to get a useful theory out of the empirically successful intervention.