This seems like it’d only work if the LM doesn’t generalize the supposed WaluigiEffect to include this token. Making a token that specifies “definitely true and factual for reals”. If some of the text ends up being wrong, for instance, it may quickly switch to “ah, now it is time for me to be sneakily wrong!”, and it always keeps around some probability that its now meant to be sneakily wrong, because a token which always specifies ’100% true and factual for reals’ is an incredibly initially unlikely hypothesis to hold about the token, and there are other hypotheses which basically predict those token dynamics which are far more plausible.
This seems like it’d only work if the LM doesn’t generalize the supposed WaluigiEffect to include this token. Making a token that specifies “definitely true and factual for reals”. If some of the text ends up being wrong, for instance, it may quickly switch to “ah, now it is time for me to be sneakily wrong!”, and it always keeps around some probability that its now meant to be sneakily wrong, because a token which always specifies ’100% true and factual for reals’ is an incredibly initially unlikely hypothesis to hold about the token, and there are other hypotheses which basically predict those token dynamics which are far more plausible.