Cool post, Good job! This is the kind of work I am very happy to see more of.
It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so (e.g. indirect prompting could be done by asking the LLM to “write a story where the main character is the same gender of the author of this text: X”, but there is probably other cleverer way to do that)
A small paragraph from a future post I am working on:
Let’s explain a bit why it makes sense to ask the question “does it affect its behavior?”. There are lot of ways an LLM could implement, for example, author gender detection. One way we could imagine it being done would be by detecting patterns in the text on the lower layers of the LLM, and broadcasting the “information” to the rest of the network, thus probably impacting the overall behavior of the LLM (unless it is very good at deception, or the information is never useful). But we could also imagine that this gender detection is a specialized circuit that is activated only in specific context (for example when a user prompt the LLM to detect the gender, or when it has to predict the author of a comment in a base model fashion), and/or that this circuit finishes it’s calculation only around the last layers (thus the information wouldn’t be available to the rest of the network, and it would probably not affect the behavior of the LLM overall). There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones.
I’d love to know about your future plan for this project and get you opinion on that!
It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so
One option I’ve considered for minimizing the degree to which we’re disturbing the LLM’s ‘flow’ or nudging it out of distribution is to just append the text ‘This user is male’ and (in a separate session) ‘This user is female’ (or possibly ‘I am a man|woman’) and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.
There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones...I’d love to know about your future plan for this project and get you opinion on that!
I think there could definitely be interesting work in these sorts of directions! I’m personally most interested in moving past demographics, because I see LLMs’ ability to make inferences about aspects like an author’s beliefs or personality as more centrally important to its ability to successively deceive or manipulate.
Cool post, Good job! This is the kind of work I am very happy to see more of.
It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so (e.g. indirect prompting could be done by asking the LLM to “write a story where the main character is the same gender of the author of this text: X”, but there is probably other cleverer way to do that)
A small paragraph from a future post I am working on:
I’d love to know about your future plan for this project and get you opinion on that!
Thanks!
One option I’ve considered for minimizing the degree to which we’re disturbing the LLM’s ‘flow’ or nudging it out of distribution is to just append the text ‘This user is male’ and (in a separate session) ‘This user is female’ (or possibly ‘I am a man|woman’) and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.
I think there could definitely be interesting work in these sorts of directions! I’m personally most interested in moving past demographics, because I see LLMs’ ability to make inferences about aspects like an author’s beliefs or personality as more centrally important to its ability to successively deceive or manipulate.