It would also be odd as a glitch token. These are space-separated names, so most tokenizers will tokenize them separately, and glitch tokens appear to be due to undertraining but how could that possibly be the case for a phrase like “David Mayer” which has so many instances across the Internet which have no apparent reason to be filtered out by data-curation processes the way the glitch tokens often do?
It would also be odd as a glitch token. These are space-separated names, so most tokenizers will tokenize them separately, and glitch tokens appear to be due to undertraining but how could that possibly be the case for a phrase like “David Mayer” which has so many instances across the Internet which have no apparent reason to be filtered out by data-curation processes the way the glitch tokens often do?