interesting. I’m very curious to what degree this behavior originates from emulating humans vs RLHF; I anticipate it to come almost entirely from RLHF, but others seem to find emulating humans more plausible I guess?
They had access to and tested the base un-RLHF’d model. Doesn’t change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.
interesting. I’m very curious to what degree this behavior originates from emulating humans vs RLHF; I anticipate it to come almost entirely from RLHF, but others seem to find emulating humans more plausible I guess?
They had access to and tested the base un-RLHF’d model. Doesn’t change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.
Huh.