Very interesting. I’ll play around with the code next time I get the chance.
2.1)
Being able to solve crosswords requires you to know how long words are. I have no idea how common they were in the training data though. Aligning things in text files is sometimes desirable, Python files are supposed to limit line lengths to 80 characters, some Linux system files store text data in tables with whitespace to make things line up. ASCII art also uses specific line lengths.
My guess for linearity would be so that the sum of the vectors has the length of their concatenation e.g. to work out the lengths of sentences.
I wonder if “number of syllables” is a feature, and whether this is consistent between languages?
2.2)
If the language model has finished outputting a word, it needs to be able to guarantee a space comes next to avoid writingtextlikethis. I guess one would expect tokens to be close in the embedding to their copies with spaces in front, so to control the position of spaces in text the model would like a separate direction to encode that information.
Very interesting. I’ll play around with the code next time I get the chance.
2.1)
Being able to solve crosswords requires you to know how long words are. I have no idea how common they were in the training data though. Aligning things in text files is sometimes desirable, Python files are supposed to limit line lengths to 80 characters, some Linux system files store text data in tables with whitespace to make things line up. ASCII art also uses specific line lengths.
My guess for linearity would be so that the sum of the vectors has the length of their concatenation e.g. to work out the lengths of sentences.
I wonder if “number of syllables” is a feature, and whether this is consistent between languages?
2.2)
If the language model has finished outputting a word, it needs to be able to guarantee a space comes next to avoid writingtextlikethis. I guess one would expect tokens to be close in the embedding to their copies with spaces in front, so to control the position of spaces in text the model would like a separate direction to encode that information.