I think the answer turns out to be: “No, the sample efficiency and generalization are better than normal training.”
From my understanding of your results, this isn’t true for removing sycophancy, the original task I was talking about? My core claim was that removing blatent sycophancy like in this anthropic dataset is pretty easy in practice.
From my understanding of your results, this isn’t true for removing sycophancy, the original task I was talking about? My core claim was that removing blatent sycophancy like in this anthropic dataset is pretty easy in practice.