Jakub Halmeš
During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.
I also found this trade-off between human readability and performance noteworthy.
Yes, fair here means that their subjective EVs are equal. The post referenced in the sibling comment calls it “Even Odds”, which is probably better.
I did not realize that. Thank you for the reference!
If Alice thinks X happens with a probability of 20% while Bob thinks it’s 40%, what would be a fair bet between them?
I created a Claude Artifact, which calculates a bet such that the expected value is the same for both.
In this case, Bob wins if X happens (he thinks it’s more likely). If Alice bets $100, he should bet $42.86, and the EV of such bet for both players (according to their beliefs) is $14.29.EDIT: I updated the calculator to handle the case when A’s probability is higher than B’s correctly.
Jakub Halmeš′s Shortform
The Inner Alignment Problem
I wrote this mostly for personal purposes. I wanted to organize my thoughts about the problem while reading the paper, and publishing the notes, even if no one reads them, forces me to write more clearly and precisely.
I would like to get some feedback if there may be value in posts such as this one for other people. Please let me know! Thank you.
I wonder if you could take the R1-Zero training regime, penalize/restrict using existing words from all languages (maybe only in the scratchpad, not the final response), and obtain a model which can solve math problems by reasoning in a non-existent language.