Given that the model eventually outputs the next token, shouldn’t the final embedding matrix be exactly your linear fit matrix multiplied by the probability of each state to output a given token? Could you use that?
Given that the model eventually outputs the next token, shouldn’t the final embedding matrix be exactly your linear fit matrix multiplied by the probability of each state to output a given token? Could you use that?