They had access to and tested the base un-RLHF’d model. Doesn’t change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.
Huh.
They had access to and tested the base un-RLHF’d model. Doesn’t change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.
Huh.