I agree it’s capable of this post-RLHF, but I would bet on the side of it being less capable than the base model. It seems much more like a passive predictive capability (inferring properties of the author to continue text by them, for instance) than an active communicative one, such that I expect it to show up more intensely in a setting where it’s allowed to make more use of that. I don’t think RLHF completely masks these capabilities (and certainly doesn’t seem like it destroys them, as gwern’s comment above says in more detail), but I expect it masks them to a non-trivial degree. For instance, I expect the base model to be better at inferring properties that are less salient to explicit expression, like the age or personality of the author.
Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about “an author”, although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn’t be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who’s well represented in the training data (ie in some sense a public figure) vs someone who isn’t (eg a typical user).
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF’d model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.
I agree it’s capable of this post-RLHF, but I would bet on the side of it being less capable than the base model. It seems much more like a passive predictive capability (inferring properties of the author to continue text by them, for instance) than an active communicative one, such that I expect it to show up more intensely in a setting where it’s allowed to make more use of that. I don’t think RLHF completely masks these capabilities (and certainly doesn’t seem like it destroys them, as gwern’s comment above says in more detail), but I expect it masks them to a non-trivial degree. For instance, I expect the base model to be better at inferring properties that are less salient to explicit expression, like the age or personality of the author.
Absolutely! I just thought it would be another interesting data point, didn’t mean to suggest that RLHF has no effect on this.
That makes sense, and definitely is very interesting in its own right!
Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about “an author”, although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn’t be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who’s well represented in the training data (ie in some sense a public figure) vs someone who isn’t (eg a typical user).
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF’d model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.