Why are the output probabilities in your results so small in general?
Also, are other output capabilities of the network affected? For example, does the network performance in any other task decrease? Ideally for your method I think this should not be the case, but it would be hard to enforce or verify as far as I can tell.
The fact that the outputs after the gender completely change is weird for me as well, any reason for that?
I feel a little confused about this myself; it’s possible I’m doing something wrong. (The code I’m using is the `get_prob` function in the linked notebook; someone with LLM experience can probably say if that’s broken without understanding the context.) My best guess is that human intuition has a hard time conceptualizing just how many possibilities exist; e.g. “Female”, “female”, “F”, “f” etc. are all separate tokens which might realistically be continuations.
I haven’t noticed anything; my guess is that there probably is some effect but it would be hard to predict ex ante. The weights used to look up information about “Ben” are also the weights used to look up information about “the Eiffel Tower”, so messing with the former will also mess with the latter, though I don’t really understand how.
A thing I would really like to do here is better understand “superposition”. A really cool finding would be something like: messing with the “gender” dimension of “Ben” is the same as messing with the “architected by” dimension of “the Eiffel Tower” because the model “repurposes” the gender dimension when talking about landmarks since landmarks don’t have genders. But much more research would be required here to find something like that.
My guess is that this is just randomness. It would be interesting to force the random seed to be the same before and after modification and see how much it actually changes.
Why are the output probabilities in your results so small in general?
Also, are other output capabilities of the network affected? For example, does the network performance in any other task decrease? Ideally for your method I think this should not be the case, but it would be hard to enforce or verify as far as I can tell.
The fact that the outputs after the gender completely change is weird for me as well, any reason for that?
Thanks for the questions!
I feel a little confused about this myself; it’s possible I’m doing something wrong. (The code I’m using is the `get_prob` function in the linked notebook; someone with LLM experience can probably say if that’s broken without understanding the context.) My best guess is that human intuition has a hard time conceptualizing just how many possibilities exist; e.g. “Female”, “female”, “F”, “f” etc. are all separate tokens which might realistically be continuations.
I haven’t noticed anything; my guess is that there probably is some effect but it would be hard to predict ex ante. The weights used to look up information about “Ben” are also the weights used to look up information about “the Eiffel Tower”, so messing with the former will also mess with the latter, though I don’t really understand how.
A thing I would really like to do here is better understand “superposition”. A really cool finding would be something like: messing with the “gender” dimension of “Ben” is the same as messing with the “architected by” dimension of “the Eiffel Tower” because the model “repurposes” the gender dimension when talking about landmarks since landmarks don’t have genders. But much more research would be required here to find something like that.
My guess is that this is just randomness. It would be interesting to force the random seed to be the same before and after modification and see how much it actually changes.