My current outlook on LLMs is that they are some combination of bullshit to fool people who are looking to be fooled, and a modest but potentially very important improvement in the capacity to search large corpuses of text in response to uncontroversial natural-language queries and automatically summarize the results. Beyond this, I think they’re massively overhyped. The most aggressive hype is that they are an AGI development project—in other words, that they’re close to being conscious, generative minds on the same order as ours, which can do as wide a range of tasks as a human. This is clearly false. The more moderate hype is that they can do meaningful generative work within the domain where they were trained: written language content (which can of course be converted to and from audio language content pretty well). For instance, they might in some limited sense be able to internally represent the content of the language they’re indexing and reproducing. This would necessarily entail the capacity for “regular expressions for natural language.” I believe that even this much more limited characterization is false, but I am less confident in this case, and there are capacities they could demonstrate that would change my mind. Language learning software seems like a good example. It seems to me that if LLMs contain anything remotely like the capacity of regular expressions for natural language that take into account the semantic values of words, they should make it relatively easy to create a language learning app that is strictly better than the best existing automated resources for smartphone users trying to learn the basics of a new-to-them language.
The consensus recommendations for a way to learn the very basics of a spoken language with relatively low time investment—filling the gap that another audiobook or podcast might fill—seem to be the Pimsleur or Paul Noble audio courses, both of which I’ve tried. They satisfy the following desiderata:
Not a phrasebook: New words and grammatical forms are introduced and explained in a logical series, so that later learning builds on earlier learning, and each incremental package of information is as small as possible.
No nonsense: Words are combined into sentences that make sense, and sentences are eventually combined in ways that are contextually appropriate. For example, the user should never be asked to form the sentence “the elephant is taking a shower,” except in specific contexts that make that sentence an exceptionally likely one. (Duolinguo fails this criterion.)
Reuse: Already-learned words are repeated in new contexts and combinations (flashcards fail this criterion), which helps with:
Spaced repetition: At first, a new word is used several times in a relatively short interval. Then it’s occasionally brought up again, often enough to make it easy to retain material at minimal review cost.
Prioritization: Common and simple words come first, and ones that the user is most likely to need even as a very basic speaker (e.g. times of day, and words a tourist needs, about meals and hotels).
The main limit of the Pimsleur and Paul Noble courses is that they are static. This means that they can’t adapt to the learner’s particular needs or conditions. Making an app interactive increases its complexity and thus the difficulty of producing it at a given level of quality. Most popular interactive language app developers have responded to this problem by reducing the complexity of the material presented to the user, so their apps frequently do not even satisfy all of the above criteria. My friend Micah and his cofounder Ofir created a program, LanguageZen, that satisfies the above desiderata, and additionally uses automation to generate new material with these additional virtues:
Automatic adaptive prioritization: The program evaluates the learner’s responses, identifies which specific words or grammatical concepts they’re having trouble with, and prioritizes these for more frequent review.
Specialized content libraries: They built a variety of libraries of topic-specific material that the user can select from depending on their needs and interests (e.g. ordering in restaurants, business language, etc.), which are then integrated with what the user has already learned.
LanguageZen was initially developed on a scrappy startup budget, and the team built two excellent products: Spanish for English speakers, and English for Portuguese speakers. But their development effort necessarily involved the up-front capital cost of hiring skilled linguists to shape the material, and because not everyone wants to learn the same language, two language offerings were simply not enough to take off virally, since friends could only effectively recommend LanguageZen to friends who wanted to learn the same language. (By contrast, someone who likes Duolinguo for German can recommend it to their friend who wants to learn French or Hebrew or Chinese, not just their friend who wants to learn German.) So while their product was good enough to attract and retain a significant user base for their product, the project won’t take off until and unless investors step up to help them over that hurdle.
But if LLMs can meaningfully and usefully generate new structured language material, they should make it much easier not only to extent the capacities of LanguageZen into new languages and expand its static content libraries, but to implement the following improvements:
Adapting spaced repetition to interruptions in usage: Even without parsing the user’s responses (which would make this robust to difficult audio conditions), if the reader rewinds or pauses on some answers, the app should be able to infer that the user is having some difficulty with the relevant material, and dynamically generate new content that repeats those words or grammatical forms sooner than the default. Likewise, if the user takes a break for a few days, weeks, or months, the ratio of old to new material should automatically adjust accordingly, as forgetting is more likely, especially of relatively new material. (And of course with text to speech, an interactive app that interpreted responses from the user could and should be able to replicate LanguageZen’s ability to specifically identify (and explain) which part of a user’s response was incorrect, and why, and use this information to adjust the schedule on which material is reviewed or introduced.)
Automatic customization of content through passive listening: I should be able to turn the app onto “listen” mode during a conversation with speakers of a foreign language. For instance, I study Tai Chi with some Chinese speakers, few of whom speak much English. So my teacher has limited ability to instruct me verbally, and I can’t follow much of the conversation when I break for lunch. I should be able to set the app to “listen” mode, and it should be able to identify words and concepts that come up frequently in such conversations, and related words and concepts, in order to generate new material that introduces these, with timing and context that satisfies all the above criteria, without retaining a transcript or recording of those conversations (to satisfy privacy concerns).
Specifically, a rules-based system tracking the above considerations could detect the need to insert additional content into the sequence based on the above considerations, and instruct an LLM to generate that content within well-specified parameters. For instance, it might give the LLM a prompt equivalent to “generate twenty sentences, limited to [range of grammatical forms] and [list of already-learned vocabulary], all of which use at least one word from [list of prioritized words], with at least one word from [list of prioritized words] in each sentence.” Then it could implement some mixture of asking the user to form those sentences in the target language, and asking the user to translate those sentences from the target language. More complex requests like constructing short conversations may also be feasible.
My current impression is that current AI technology is simply not good enough to implement a high-quality version of this product, between two commonly spoken languages with large text corpuses, without a huge time investment from experts carefully shaping and vetting its material and effectively curating static topic libraries within which the automation could at best make minor or highly supervised, human-in-the-loop variations. Someone might be able to make a lot of money changing my mind.
ETA:
I think a lot of people mistaking LLMs for minds are simply underestimating the potential of a deeply literate culture, for which LLMs are a substitute. The Hávamál gets the approximate magnitude of the value of knowing runes correct, though it—as a poem from a nonliterate culture—naturally doesn’t get the details correct. Here are the “songs” Odin knows immediately after learning the runes:
145.
Those songs I know, which nor sons of men
nor queen in a king’s court knows;
the first is Help which will bring thee help
in all woes and in sorrow and strife.146.
A second I know, which the son of men
must sing, who would heal the sick.147.
A third I know: if sore need should come
of a spell to stay my foes;
when I sing that song, which shall blunt their swords,
nor their weapons nor staves can wound.148.
A fourth I know: if men make fast
in chains the joints of my limbs,
when I sing that song which shall set me free,
spring the fetters from hands and feet.149.
A fifth I know: when I see, by foes shot,
speeding a shaft through the host,
flies it never so strongly I still can stay it,
if I get but a glimpse of its flight.150.
A sixth I know: when some thane would harm me
in runes on a moist tree’s root,
on his head alone shall light the ills
of the curse that he called upon mine.151.
A seventh I know: if I see a hall
high o’er the bench-mates blazing,
flame it ne’er so fiercely I still can save it, --
I know how to sing that song.152.
An eighth I know: which all can sing
for their weal if they learn it well;
where hate shall wax ’mid the warrior sons,
I can calm it soon with that song.153.
A ninth I know: when need befalls me
to save my vessel afloat,
I hush the wind on the stormy wave,
and soothe all the sea to rest.154.
A tenth I know: when at night the witches
ride and sport in the air,
such spells I weave that they wander home
out of skins and wits bewildered.155.
An eleventh I know: if haply I lead
my old comrades out to war,
I sing ’neath the shields, and they fare forth mightily
safe into battle,
safe out of battle,
and safe return from the strife.156.
A twelfth I know: if I see in a tree
a corpse from a halter hanging,
such spells I write, and paint in runes,
that the being descends and speaks.157.
A thirteenth I know: if the new-born son
of a warrior I sprinkle with water,
that youth will not fail when he fares to war,
never slain shall he bow before sword.158.
A fourteenth I know: if I needs must number
the Powers to the people of men,
I know all the nature of gods and of elves
which none can know untaught.159.
A fifteenth I know, which Folk-stirrer sang,
the dwarf, at the gates of Dawn;
he sang strength to the gods, and skill to the elves,
and wisdom to Odin who utters.160.
A sixteenth I know: when all sweetness and love
I would win from some artful wench,
her heart I turn, and the whole mind change
of that fair-armed lady I love.161.
A seventeenth I know: so that e’en the shy maiden
is slow to shun my love.162.
These songs, Stray-Singer, which man’s son knows not,
long shalt thou lack in life,
though thy weal if thou win’st them, thy boon if thou obey’st them
thy good if haply thou gain’st them.163.
An eighteenth I know: which I ne’er shall tell
to maiden or wife of man
save alone to my sister, or haply to her
who folds me fast in her arms;
most safe are secrets known to but one-
the songs are sung to an end.
I use ChatGPT and Claude to try to learn Macedonian, because there is only very little learning material available for that language. For example, they can (with a few errors sometimes) explain grammatical concepts or give me sentences to translate. I have not found a good way of storing a description of my abilities and weaknesses across conversations, but within a conversation they are good at adapting the difficulty of the questions to the quality of my answers.
Unfortunately I’m not aware of any tools that can pronounce or transcribe Macedonian.
Edit: I still do most of my language learning using Anki, Google translate and a book for learning Macedonian. Probably because using LLMs is not sufficiently gamified and because of the small inconveniences having to ask for questions instead of simply doing one exercise after another.
Seems like this one is mostly a matter of schlep rather than capability. The abilities you would need to make this happen are
Have a highly granular curriculum for what vocabulary and what skills are required to learn the language and a plan for what order to teach them in / what spaced repetition schedule to aim for
Have a granular and continuously updated model of the user’s current knowledge of vocabulary, rules of grammar and acceptability, idioms, if there are any phonemes or phoneme sequences they have trouble with
Given specific highly granular learning goals (e.g. “understanding when to use preterite vs imperfect when conjugating saber” in spanish) within the curriculum and the model of the user’s knowledge and abilities, produce exercises which teach / evaluate those specific skills.
Determine whether the user had trouble with the exercise, and if so what the trouble was
Based on the type of trouble the user had, describe whay updates should be made to the model of the user’s knowledge and vocabulary
Correctly apply the updates from (6)
Adapt to deviations from the spaced repetition plan (tbh this seems like the sort of thing you would want to do with normal code)
I expect that the hardest things here will be 1, 2, and 6, and I expect them to be hard because of the volume of required work rather than the technical difficulty. But I also expect the LanguageZen folks have already tried this and could give you a more detailed view about what the hard bits are here.
This sounds like either a privacy nightmare or a massive battery drain. The good language models are quite compute intensive, so running them on a battery-powered phone will drain the battery very fast. Especially since this would need to hook into the “granular model of what the user knows” piece.