If I parse things right, the initial state is something like 1⁄3 “I’m Luigi” 1⁄3 “I’m bowser” and 1⁄3 “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.
If I parse things right, the initial state is something like 1⁄3 “I’m Luigi” 1⁄3 “I’m bowser” and 1⁄3 “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.