and user MarcoP already noticed that Bing AI’s “Voynichese” doesn’t follow VMS statistics in one obvious respect: “The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don’t appear elsewhere” whereas “The [Bing AI] continuation is entirely made up of words from Takahashi’s transliteration.” So, no wonder all of the “vords” in the AI’s continuation seemed to pass the “sniff test” as valid Voynich vords if Bing AI only used existing Voynich vords! That’s one easy way to make sure that you only use valid vords without needing to have a clue about what makes a Voynichese vord valid or how to construct a new valid Voynichese vord. So my initial optimism that Bing AI understood something deep about Voynichese is probably mistaken.
That said, would it be possible to train a new LLM in a more targeted way just on English (so that we can interact with it) and on Voynichese so that Voynichese would be a more salient part of its training corpus? Is there enough Voynichese (~170,000 characters, or 38,000 “vords”) to get somewhere with that with current LLMs?
I’m glad others are trying this out. I crossposted this over on the Voynich Ninja forum:
https://www.voynich.ninja/thread-3977.html
and user MarcoP already noticed that Bing AI’s “Voynichese” doesn’t follow VMS statistics in one obvious respect: “The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don’t appear elsewhere” whereas “The [Bing AI] continuation is entirely made up of words from Takahashi’s transliteration.” So, no wonder all of the “vords” in the AI’s continuation seemed to pass the “sniff test” as valid Voynich vords if Bing AI only used existing Voynich vords! That’s one easy way to make sure that you only use valid vords without needing to have a clue about what makes a Voynichese vord valid or how to construct a new valid Voynichese vord. So my initial optimism that Bing AI understood something deep about Voynichese is probably mistaken.
That said, would it be possible to train a new LLM in a more targeted way just on English (so that we can interact with it) and on Voynichese so that Voynichese would be a more salient part of its training corpus? Is there enough Voynichese (~170,000 characters, or 38,000 “vords”) to get somewhere with that with current LLMs?