Whack-A-Mole fixes, from RLHF to finetuning, are about teaching the system to not demonstrate problematic behavior, not about fundamentally fixing that behavior.
Based on what? Problematic behavior avoidance does actually generalize in practice, right?
Based on what? Problematic behavior avoidance does actually generalize in practice, right?