Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about “an author”, although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn’t be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who’s well represented in the training data (ie in some sense a public figure) vs someone who isn’t (eg a typical user).
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF’d model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.
Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about “an author”, although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn’t be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who’s well represented in the training data (ie in some sense a public figure) vs someone who isn’t (eg a typical user).
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF’d model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.