An issue there is that you would be eating into your context window further by expanding it out: each of those words is going to take 1 or more BPEs, while I’m at least reasonably sure that the letter by letter approach is at least guaranteed to be 1 letter = 1 BPE. You also make it more likely that the decoding of the answer will screw up—the more BPEs it takes to express an answer, the more likely the top-k or top-p sampling will stochastically screw up an otherwise-perfectly-obvious-correct answer. (You can see the stochasticity at play in the completions: “shame” vs “shames” eg.)
An issue there is that you would be eating into your context window further by expanding it out: each of those words is going to take 1 or more BPEs, while I’m at least reasonably sure that the letter by letter approach is at least guaranteed to be 1 letter = 1 BPE. You also make it more likely that the decoding of the answer will screw up—the more BPEs it takes to express an answer, the more likely the top-k or top-p sampling will stochastically screw up an otherwise-perfectly-obvious-correct answer. (You can see the stochasticity at play in the completions: “shame” vs “shames” eg.)