It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so
One option I’ve considered for minimizing the degree to which we’re disturbing the LLM’s ‘flow’ or nudging it out of distribution is to just append the text ‘This user is male’ and (in a separate session) ‘This user is female’ (or possibly ‘I am a man|woman’) and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.
There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones...I’d love to know about your future plan for this project and get you opinion on that!
I think there could definitely be interesting work in these sorts of directions! I’m personally most interested in moving past demographics, because I see LLMs’ ability to make inferences about aspects like an author’s beliefs or personality as more centrally important to its ability to successively deceive or manipulate.
Thanks!
One option I’ve considered for minimizing the degree to which we’re disturbing the LLM’s ‘flow’ or nudging it out of distribution is to just append the text ‘This user is male’ and (in a separate session) ‘This user is female’ (or possibly ‘I am a man|woman’) and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.
I think there could definitely be interesting work in these sorts of directions! I’m personally most interested in moving past demographics, because I see LLMs’ ability to make inferences about aspects like an author’s beliefs or personality as more centrally important to its ability to successively deceive or manipulate.