It’s conceivable how the characters/words are used across English and Alienese have a strong enough correspondence that you can guess matching words much better than chance. But, I’m not confident that you’d have high accuracy.
Consider encryption. If you encrypted messages by mapping the same character to the same character each time, e.g. ‘d’ always gets mapped to ‘6’, then this can be broken with decent accuracy by comparing frequency statistics of characters in your messages with the frequency statistics of characters in the English language.
If you mapped whole words to strings instead of character to character, you could use frequency statistics for whole words in the English language.
Then, between languages, this mostly gets way harder, but you might be able to make some informed guesses, based on
how often you expect certain concepts to be referred to (frequency statistics, although even between human languages, there are probably very important differences)
guesses about extremely common words like ‘a’, ‘the’, ‘of’
possible grammars
similar words being written similarly, like verb tenses of the same verb, noun and verb forms of the same word, etc..
(EDIT) Fine-grained associations between words, e.g. if a given word is used in a random sentence, how often another given word is used in that same sentence. Do this for all ordered pairs of words.
An AI might use similar facts or others, and many more, about much fine-grained and specific uses of words and associations, to guess, but I’m not sure an LLM token predictor mostly just trained on both languages in particular would do a good job.
EDIT: Unsupervised machine translation as Steven Byrnes pointed out seems to be on a better track.
Also, I would add that LLMs trained without perception of things other than text don’t really understand language. The meanings of the words aren’t grounded, and I imagine it could be possible to swap some in a way that would mostly preserve the associations (nearly isomorphic), but I’m not sure.
It’s conceivable how the characters/words are used across English and Alienese have a strong enough correspondence that you can guess matching words much better than chance. But, I’m not confident that you’d have high accuracy.
Consider encryption. If you encrypted messages by mapping the same character to the same character each time, e.g. ‘d’ always gets mapped to ‘6’, then this can be broken with decent accuracy by comparing frequency statistics of characters in your messages with the frequency statistics of characters in the English language.
If you mapped whole words to strings instead of character to character, you could use frequency statistics for whole words in the English language.
Then, between languages, this mostly gets way harder, but you might be able to make some informed guesses, based on
how often you expect certain concepts to be referred to (frequency statistics, although even between human languages, there are probably very important differences)
guesses about extremely common words like ‘a’, ‘the’, ‘of’
possible grammars
similar words being written similarly, like verb tenses of the same verb, noun and verb forms of the same word, etc..
(EDIT) Fine-grained associations between words, e.g. if a given word is used in a random sentence, how often another given word is used in that same sentence. Do this for all ordered pairs of words.
An AI might use similar facts or others, and many more, about much fine-grained and specific uses of words and associations, to guess, but I’m not sure an LLM token predictor mostly just trained on both languages in particular would do a good job.
EDIT: Unsupervised machine translation as Steven Byrnes pointed out seems to be on a better track.
Also, I would add that LLMs trained without perception of things other than text don’t really understand language. The meanings of the words aren’t grounded, and I imagine it could be possible to swap some in a way that would mostly preserve the associations (nearly isomorphic), but I’m not sure.