While looking at the end of the token list for anomalous tokens seems like a good place to start, the ” petertodd” token was actually at about 3⁄4 of the way through the tokens (37,444 on the 50k model --> 74,888 on the 100k model, approximately), if the existence of anomalous tokens follows a similar “typology” regardless of the tokenizer used, then the locations of those tokens in the overall list might correlate in meaningful ways. Maybe worth looking into.
While looking at the end of the token list for anomalous tokens seems like a good place to start, the ” petertodd” token was actually at about 3⁄4 of the way through the tokens (37,444 on the 50k model --> 74,888 on the 100k model, approximately), if the existence of anomalous tokens follows a similar “typology” regardless of the tokenizer used, then the locations of those tokens in the overall list might correlate in meaningful ways. Maybe worth looking into.