What is quite interesting about that dataset is the fact it has strings in the form “*number|*weirdstring*|*number*” which I remember seeing in some methods of training LLMs, i.e. “|” being used as delimiter for tokens. They could be poisoned training examples or have some weird effect in retrieval.
What is quite interesting about that dataset is the fact it has strings in the form “*number|*weirdstring*|*number*” which I remember seeing in some methods of training LLMs, i.e. “|” being used as delimiter for tokens. They could be poisoned training examples or have some weird effect in retrieval.