Well they did this with o3′s deliberative alignment paper. The results seem promising, but they used an “easy” OOD test for LLM’s (language), and didn’t compare it to the existing baseline of RHLF. Still an interesting paper.
Well they did this with o3′s deliberative alignment paper. The results seem promising, but they used an “easy” OOD test for LLM’s (language), and didn’t compare it to the existing baseline of RHLF. Still an interesting paper.