Similar question: Let’s start with an easier but I think similarly shaped problem.
We have two next-token predictors. Both are trained on English text, but each one was trained on a slightly different corpus (let’s say one the first one was trained on all arxiv papers and the other one was trained on all public domain literature), and each one uses a different tokenizer (let’s say the arxiv one used a BPE tokenizer and the literature one used some unknown tokenization stream).
Unfortunately, the tokenizer for the second corpus has been lost. You still have the tokenized dataset for the second corpus, and you still have the trained sequence predictor, but you’ve lost the token <-> word mapping. Also due to lobbying, the public domain is no longer a thing and so you don’t have access to the original dataset to try to piece things back together.
You can still feed a sequence of integers which encode tokens to the literature-next-token-predictor, and it will spit out integers corresponding to its prediction of the next token, but you don’t know what English words those tokens correspond to.
I expect, in this situation, that you could do stuff like “create a new sequence predictor that is trained on the tokenized version of both corpora, so that the new predictor will hopefully use some shared machinery for next token prediction for each dataset, and then do the whole sparse autoencoder thing to try and tease apart what those shared abstractions are to build hypotheses”.
Even in that “easy” case, though, I think it’s a bit harder than “just ask the LLM”, but the easy case is, I think, viable.
Similar question: Let’s start with an easier but I think similarly shaped problem.
We have two next-token predictors. Both are trained on English text, but each one was trained on a slightly different corpus (let’s say one the first one was trained on all arxiv papers and the other one was trained on all public domain literature), and each one uses a different tokenizer (let’s say the arxiv one used a BPE tokenizer and the literature one used some unknown tokenization stream).
Unfortunately, the tokenizer for the second corpus has been lost. You still have the tokenized dataset for the second corpus, and you still have the trained sequence predictor, but you’ve lost the token <-> word mapping. Also due to lobbying, the public domain is no longer a thing and so you don’t have access to the original dataset to try to piece things back together.
You can still feed a sequence of integers which encode tokens to the literature-next-token-predictor, and it will spit out integers corresponding to its prediction of the next token, but you don’t know what English words those tokens correspond to.
I expect, in this situation, that you could do stuff like “create a new sequence predictor that is trained on the tokenized version of both corpora, so that the new predictor will hopefully use some shared machinery for next token prediction for each dataset, and then do the whole sparse autoencoder thing to try and tease apart what those shared abstractions are to build hypotheses”.
Even in that “easy” case, though, I think it’s a bit harder than “just ask the LLM”, but the easy case is, I think, viable.