An exercise to learn written Hindi just from written text. I did this a year ago and copied this from my notes. Sorry if it’s a bit rough.
I recently had a discussion with an Indian friend in my dorm about languages. He made the argument that logographic languages like Chinese are less prone to meaning drift. Also, that they are fundamentally more attached to their meaning, and that you could basically bootstrap your understanding of written Chinese without ever learning the spoken language. Up to that point, I basically agreed, but said I would expect (at least in theory) the same to be true for any written language (thinking about Huffman coding/Solomonoff induction and LLMs in particular).
I made the argument that the longer the chains of text/n-grams you learn to predict, the better you are at actually understanding things. Being good at knowing the frequency of letters seems pointless, but knowing the probability of 5-letter sequences makes you already pretty proficient at spelling. 20 letter sequences make you good at grammar. Finally, when you are at 100 letter sequences you get something resembling coherence.
I noticed I didn’t have a good intuition for how far you could get by this method in practice if you can’t read the entire internet like an LLM. But testing this is cheap! So I said he should send me a book in Hindi and I would try to find the most impressive thing I could learn from just the text in 30 minutes. As it turns out, this task was in fact pretty hard, but also fun! I ended up agreeing with my friend that this is probably slightly easier for logographic languages.
Exercise (Optional): Grab this text. Now set a 10-minute timer and try to learn as much Chinese as you can. Take notes and put your takeaways in the comments inside spoiler tags. For comparison, repeat the exercise on this text in Hindi, which is syllable based. If this felt rushed, but fun, try another 30 minutes on each text. Consider the rest of this post spoiler.
Here are my notes from when I did this exercise:
Actually going over data multiple times seems like the most efficient way to get the most out of data? I roughly think I remember even lots of AI researchers on podcasts on this topic saying something like “our methods are not good enough, they need to go over data multiple times!” and I am not actually sure this is a problem? I went over the book multiple times just to look for features in the text I could easily exploit for future learning, but every time I was extracting different information that I could only get from the text after having gone through it a first time.
the more pieces you have, the easier it gets, everything that was already familiar is the most helpful in exploiting, cheating all the way:
recognizing structure of books
knowing that spaces probably separate words
knowing frequent and short words tend to be more basic
knowing the structure of how humans reason: the word “AND” probably is between things it connects and not at the end (serving as glue as well as an indicator of the boundary) (seeing a word at the end of the sentence makes it less likely to be the word for AND)
tradeoff: sometimes I focused on a thing that did not occur often in the text, but a particular sqiggle (like a lying 8) was really easy to tell from looking at the whole page, so those can be exploited very fast and are worth my time earlier.
There were so many alleys to get progress on this, and a lot of the focus was just on what next? Was fun.
I didn’t get a lot further than recognizing a question word like “what”, which was pretty easy to do, because the text I used also used question marks.
Unfortunately, I can’t check how far I would get with Chinese, because I already learned Japanese. I considered trying Egyptian Hieroglyphs, but couldn’t find any long texts rather than spoiler material that would teach me how to read it. If you have any links handy, let me know!
Learning Written Hindi From Scratch
An exercise to learn written Hindi just from written text. I did this a year ago and copied this from my notes. Sorry if it’s a bit rough.
I recently had a discussion with an Indian friend in my dorm about languages. He made the argument that logographic languages like Chinese are less prone to meaning drift. Also, that they are fundamentally more attached to their meaning, and that you could basically bootstrap your understanding of written Chinese without ever learning the spoken language. Up to that point, I basically agreed, but said I would expect (at least in theory) the same to be true for any written language (thinking about Huffman coding/Solomonoff induction and LLMs in particular). I made the argument that the longer the chains of text/n-grams you learn to predict, the better you are at actually understanding things. Being good at knowing the frequency of letters seems pointless, but knowing the probability of 5-letter sequences makes you already pretty proficient at spelling. 20 letter sequences make you good at grammar. Finally, when you are at 100 letter sequences you get something resembling coherence. I noticed I didn’t have a good intuition for how far you could get by this method in practice if you can’t read the entire internet like an LLM. But testing this is cheap! So I said he should send me a book in Hindi and I would try to find the most impressive thing I could learn from just the text in 30 minutes. As it turns out, this task was in fact pretty hard, but also fun! I ended up agreeing with my friend that this is probably slightly easier for logographic languages.
Exercise (Optional): Grab this text. Now set a 10-minute timer and try to learn as much Chinese as you can. Take notes and put your takeaways in the comments inside spoiler tags. For comparison, repeat the exercise on this text in Hindi, which is syllable based. If this felt rushed, but fun, try another 30 minutes on each text. Consider the rest of this post spoiler.
Here are my notes from when I did this exercise:
Actually going over data multiple times seems like the most efficient way to get the most out of data? I roughly think I remember even lots of AI researchers on podcasts on this topic saying something like “our methods are not good enough, they need to go over data multiple times!” and I am not actually sure this is a problem? I went over the book multiple times just to look for features in the text I could easily exploit for future learning, but every time I was extracting different information that I could only get from the text after having gone through it a first time.
the more pieces you have, the easier it gets, everything that was already familiar is the most helpful in exploiting, cheating all the way:
recognizing structure of books
knowing that spaces probably separate words
knowing frequent and short words tend to be more basic
knowing the structure of how humans reason: the word “AND” probably is between things it connects and not at the end (serving as glue as well as an indicator of the boundary) (seeing a word at the end of the sentence makes it less likely to be the word for AND)
tradeoff: sometimes I focused on a thing that did not occur often in the text, but a particular sqiggle (like a lying 8) was really easy to tell from looking at the whole page, so those can be exploited very fast and are worth my time earlier.
There were so many alleys to get progress on this, and a lot of the focus was just on what next? Was fun.
I didn’t get a lot further than recognizing a question word like “what”, which was pretty easy to do, because the text I used also used question marks.
Unfortunately, I can’t check how far I would get with Chinese, because I already learned Japanese. I considered trying Egyptian Hieroglyphs, but couldn’t find any long texts rather than spoiler material that would teach me how to read it. If you have any links handy, let me know!