I’m not sure! My guess is that it’s because some athlete names were two tokens and others were three tokens (or longer) and we left padded so all prompts were the same length (and masked the attention so it couldn’t attend to the padding tokens). We definitely didn’t need to do this, and could have just filtered for two token names, it’s not an important detail.
What’s up with the
<pad>
token (<pad>==<bos>==<eos>
in Pythia) in the attention diagram? I assume that doesn’t need to be there?I’m not sure! My guess is that it’s because some athlete names were two tokens and others were three tokens (or longer) and we left padded so all prompts were the same length (and masked the attention so it couldn’t attend to the padding tokens). We definitely didn’t need to do this, and could have just filtered for two token names, it’s not an important detail.
Ok thanks!