The approach of the linked article tries to match words meaning the same thing across languages by separately building a vector embedding of each language corpus and then looking for structural (neighborhood) similarity between the embeddings, with an extra global ‘rotation’ step mapping the two vector spaces on one another.
So if both languages have a word for “cat”, and many other words related to cats, and the relationship between these words is the same in both languages (e.g. ‘cat’ is close to ‘dog’ in a different way than it is close to ‘food’), then these words can be successfully translated.
But if one language has a tiny vocabulary compared to the other one, and the vocabulary isn’t even a subset of the other language’s (dolphins don’t talk about cats), then you can’t get far. Unless you have an English training dataset that only uses words that do have translations in Dolphin. But we don’t know what dolphins talk about, so we can’t build this dataset.
Also, this is machine learning on text with distinct words; do we even have a ‘separate words’ parser for dolphin signals?
If Language A and Language B have word embeddings that partially overlap and partially don’t, that doesn’t necessarily mean it’s impossible to match the part that does overlap. After all, that always happens to some extent, even between English and French (i.e. not every English word corresponds to a single French word or vice-versa), but the matching is still possible. It would obviously be a much much more extreme non-overlap for English vs dolphin, and that certainly makes it less likely to work, but doesn’t prove it impossible. (It might require changing the algorithm somewhat, or generating hundreds of different candidate translations and narrowing them down, etc.)
I don’t think “distinct words” is necessarily an unsolvable problem. Many languages have complicated boundaries between words, and thus there are more flexible methods of tokenization that don’t rely on specific language structure like “words” (basically, they just combine the most common recurring patterns of consecutive phonemes / letters into tokens, up until they reach a predetermined vocabulary size, or something like that, if I recall). You would also need some sort of clustering algorithm to lump sounds into discrete phonemes (assuming of course that dolphins are communicating with discrete phonemes).
So my feeling is “very unlikely to work, but not impossible” … conditioned on dolphins having a complex representational language vaguely analogous to human language (which I don’t know anything about either way).
I think the disparity in number of words is proportionally so large that this method won’t work. The (small) hypothetical set of dolphin words wouldn’t match to a small subset of English words, because what’s being matched is really the (embedded) structure of the relationship between the words, and any sufficiently small subset of English words loses most of its interesting structure because its ‘real’ structure relates it to many words outside that subset.
Support that dolphins (hypothetically! counterfactually! not realistically!) use only 10 words to talk about fish, but humans use 100 words to do the same. I expect you can’t match the relationship structure of the 10 dolphin words to the much more complex structure of the 100 human words. But no subset of ~10 English words out of the 100 is a meaningful subset that humans could use to talk about fish.
So the blocker I mentioned. OK, thanks. Well, maybe we could make a translator between whales and dolphins then.
Or we could make a translator between a corpus of scuba diver conversations and dolphins.
We might be able to parse dolphin signals into separate words using ordinary unsupervised learning, no?
Why does the relative size of the vocabularies matter? I’d guess it would be irrelevant, the main factor would be how much overlap the two languages have. Maybe the absolute (as opposed to relative) sizes would matter.
The approach of the linked article tries to match words meaning the same thing across languages by separately building a vector embedding of each language corpus and then looking for structural (neighborhood) similarity between the embeddings, with an extra global ‘rotation’ step mapping the two vector spaces on one another.
So if both languages have a word for “cat”, and many other words related to cats, and the relationship between these words is the same in both languages (e.g. ‘cat’ is close to ‘dog’ in a different way than it is close to ‘food’), then these words can be successfully translated.
But if one language has a tiny vocabulary compared to the other one, and the vocabulary isn’t even a subset of the other language’s (dolphins don’t talk about cats), then you can’t get far. Unless you have an English training dataset that only uses words that do have translations in Dolphin. But we don’t know what dolphins talk about, so we can’t build this dataset.
Also, this is machine learning on text with distinct words; do we even have a ‘separate words’ parser for dolphin signals?
If Language A and Language B have word embeddings that partially overlap and partially don’t, that doesn’t necessarily mean it’s impossible to match the part that does overlap. After all, that always happens to some extent, even between English and French (i.e. not every English word corresponds to a single French word or vice-versa), but the matching is still possible. It would obviously be a much much more extreme non-overlap for English vs dolphin, and that certainly makes it less likely to work, but doesn’t prove it impossible. (It might require changing the algorithm somewhat, or generating hundreds of different candidate translations and narrowing them down, etc.)
I don’t think “distinct words” is necessarily an unsolvable problem. Many languages have complicated boundaries between words, and thus there are more flexible methods of tokenization that don’t rely on specific language structure like “words” (basically, they just combine the most common recurring patterns of consecutive phonemes / letters into tokens, up until they reach a predetermined vocabulary size, or something like that, if I recall). You would also need some sort of clustering algorithm to lump sounds into discrete phonemes (assuming of course that dolphins are communicating with discrete phonemes).
So my feeling is “very unlikely to work, but not impossible” … conditioned on dolphins having a complex representational language vaguely analogous to human language (which I don’t know anything about either way).
I think the disparity in number of words is proportionally so large that this method won’t work. The (small) hypothetical set of dolphin words wouldn’t match to a small subset of English words, because what’s being matched is really the (embedded) structure of the relationship between the words, and any sufficiently small subset of English words loses most of its interesting structure because its ‘real’ structure relates it to many words outside that subset.
Support that dolphins (hypothetically! counterfactually! not realistically!) use only 10 words to talk about fish, but humans use 100 words to do the same. I expect you can’t match the relationship structure of the 10 dolphin words to the much more complex structure of the 100 human words. But no subset of ~10 English words out of the 100 is a meaningful subset that humans could use to talk about fish.
Thanks, I found that explanation very helpful.
So the blocker I mentioned. OK, thanks. Well, maybe we could make a translator between whales and dolphins then.
Or we could make a translator between a corpus of scuba diver conversations and dolphins.
We might be able to parse dolphin signals into separate words using ordinary unsupervised learning, no?
Why does the relative size of the vocabularies matter? I’d guess it would be irrelevant, the main factor would be how much overlap the two languages have. Maybe the absolute (as opposed to relative) sizes would matter.