Doesn’t strike me as inevitable at all, just a result of OpenAI following similar methods for creating their tokenizer twice. (In both cases, leading to a few long strings being included as tokens even though they don’t actually appear frequently in large corpuses.)
They presumably had already made the GPT-4 tokenizer long before SolidGoldMagikarp was discovered in the GPT-2/GPT-3 one.
“Under development” and “currently training” I interpret as having significantly different meanings.