I’m a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test.
Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can’t penalize 2 tokens at a time.
Even if your algorithm works for 2 tokens I’d like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the “ground truth” reward. If it generalizes to those scenario I’d be much more enthusiastic about your approach!
I’m a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can’t penalize 2 tokens at a time.
Even if your algorithm works for 2 tokens I’d like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the “ground truth” reward. If it generalizes to those scenario I’d be much more enthusiastic about your approach!